2008 Eighth IEEE International Conference on Data Mining最新文献

英文中文

A Joint Matrix Factorization Approach to Unsupervised Action Categorization 无监督动作分类的联合矩阵分解方法

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.59

Peng Cui, Fei Wang, Lifeng Sun, Shiqiang Yang

In this paper, a novel unsupervised approach to mining categories from action video sequences is presented. This approach consists of two modules: action representation and learning model. Videos are regarded as spatially distributed dynamic pixel time series, which are quantized into pixel prototypes. After replacing the pixel time series with their corresponding prototype labels, the video sequences are compressed into 2D action matrices. We put these matrices together to form an multi-action tensor, and propose the joint matrix factorization method to simultaneously cluster the pixel prototypes into pixel signatures, and matrices into action classes. The approach is tested on public and popular Weizmann data set, and promising results are achieved.

本文提出了一种新的从动作视频序列中挖掘类别的无监督方法。该方法包括两个模块:动作表示和学习模型。将视频视为空间分布的动态像素时间序列，将其量化为像素原型。将像素时间序列替换为对应的原型标签后，将视频序列压缩成二维动作矩阵。我们将这些矩阵组合在一起形成一个多动作张量，并提出联合矩阵分解方法，将像素原型聚类为像素签名，将矩阵聚类为动作类。在公开和流行的Weizmann数据集上对该方法进行了测试，取得了令人满意的结果。

引用次数: 5

Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data 全局启动，局部优化，全局预测:提高不平衡数据的性能

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.87

David A. Cieslak, N. Chawla

Class imbalance is a ubiquitous problem in supervised learning and has gained wide-scale attention in the literature. Perhaps the most prevalent solution is to apply sampling to training data in order improve classifier performance. The typical approach will apply uniform levels of sampling globally. However, we believe that data is typically multi-modal, which suggests sampling should be treated locally rather than globally. It is the purpose of this paper to propose a framework which first identifies meaningful regions of data and then proceeds to find optimal sampling levels within each. This paper demonstrates that a global classifier trained on data locally sampled produces superior rank-orderings on a wide range of real-world and artificial datasets as compared to contemporary global sampling methods.

班级失衡是监督学习中普遍存在的问题，在文献中引起了广泛的关注。也许最普遍的解决方案是对训练数据进行采样，以提高分类器的性能。典型的方法是在全局范围内采用统一水平的抽样。然而，我们认为数据通常是多模态的，这表明抽样应该局部处理，而不是全局处理。本文的目的是提出一个框架，该框架首先识别数据的有意义区域，然后在每个区域内找到最佳采样水平。本文证明，与当代全局采样方法相比，在局部采样数据上训练的全局分类器在广泛的现实世界和人工数据集上产生优越的排名排序。

引用次数: 73

Variance Minimization Least Squares Support Vector Machines for Time Series Analysis 方差最小化最小二乘支持向量机的时间序列分析

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.79

Róbert Ormándi

Here we propose a novel machine learning method for time series forecasting which is based on the widely-used Least Squares Support Vector Machine (LS-SVM) approach. The objective function of our method contains a weighted variance minimization part as well. This modification makes the method more efficient in time series forecasting, as this paper will show. The proposed method is a generalization of the well-known LS-SVM algorithm. It has similar advantages like the applicability of the kernel-trick, it has a linear and unique solution, and a short computational time, but can perform better in certain scenarios. The main purpose of this paper is to introduce the novel Variance Minimization Least Squares Support Vector Machine (VMLS-SVM) method and to show its superiority through experimental results using standard benchmark time series prediction datasets.

本文提出了一种基于最小二乘支持向量机(LS-SVM)方法的时间序列预测机器学习方法。该方法的目标函数也包含加权方差最小化部分。这一改进使该方法在时间序列预测中更加有效，本文将对此进行说明。该方法是对LS-SVM算法的推广。它具有类似于核技巧的适用性的优点，它具有线性且唯一的解决方案，并且计算时间短，但在某些情况下可以执行得更好。本文的主要目的是介绍一种新的方差最小化最小二乘支持向量机(VMLS-SVM)方法，并通过使用标准基准时间序列预测数据集的实验结果来展示其优越性。

引用次数: 6

Sparse Maximum Margin Logistic Regression for Credit Scoring 信用评分的稀疏最大边际Logistic回归

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.84

Sabyasachi Patra, K. Shanker, D. Kundu

The objective of credit scoring model is to categorize the applicants as either accepted or rejected debtors prior to granting credit. A modified logistic loss function is proposed which can approximate hinge loss and therefore the resulting model, maximum margin logistic regression (MMLR), has the classification capability of support vector machine (SVM) with low computational cost. Finally, to classify credit applicants, an efficient algorithm is also described for MMLR based on epsilon-boosting which can provide sparse estimation of coefficients for better stability and interpretability.

信用评分模型的目标是在授予信用之前将申请人分类为接受或拒绝的债务人。提出了一种近似于铰链损失的修正逻辑损失函数，使所得到的最大边际逻辑回归模型具有支持向量机(SVM)的分类能力，且计算成本低。最后，为了对信用申请人进行分类，本文还描述了一种基于epsilon-boosting的MMLR算法，该算法可以提供稀疏的系数估计，以获得更好的稳定性和可解释性。

引用次数: 4

Quantitative Association Analysis Using Tree Hierarchies 使用树形层次的定量关联分析

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.100

Feng Pan, Lynda Yang, L. McMillan, F. P. Villena, D. Threadgill, Wei Wang

Association analysis arises in many important applications such as bioinformatics and business intelligence. Given a large collection of measurements over a set of samples, association analysis aims to find dependencies of target variables to subsets of measurements. Most previous algorithms adopt a two-stage approach; they first group samples based on the similarity in the subset of measurements, and then they examine the association between these groups and the specified target variables without considering the inter-group similarities or alternative groupings. This can lead to cases where the strength of association depends significantly on arbitrary clustering choices. In this paper, we propose a tree-based method for quantitative association analysis. Tree hierarchies derived from sample similarities represent many possible sample groupings. They also provide a natural way to incorporate domain knowledge such as ontologies and to identify and remove outliers. Given a tree hierarchy, our association analysis evaluates all possible groupings and selects the one with strongest association to the target variable. We introduce an efficient algorithm, TreeQA, to systematically explore the search-space of all possible groupings in a set of input trees, with integrated permutation tests. Experimental results show that TreeQA is able to handlelarge-scale association analysis very efficiently and is more effective and robust in association analysis than previous methods.

关联分析出现在生物信息学和商业智能等许多重要应用中。给定一组样本上的大量测量值，关联分析的目的是找到目标变量与测量子集的依赖关系。大多数先前的算法采用两阶段方法;他们首先根据测量子集中的相似性对样本进行分组，然后在不考虑组间相似性或替代分组的情况下，检查这些组与指定目标变量之间的关联。这可能导致关联强度在很大程度上取决于任意聚类选择的情况。本文提出了一种基于树的定量关联分析方法。由样本相似性衍生的树状层次结构代表了许多可能的样本分组。它们还提供了一种自然的方法来合并领域知识，如本体，并识别和删除异常值。给定一个树状层次结构，我们的关联分析评估所有可能的分组，并选择与目标变量关联最强的一个。我们引入了一种高效的算法TreeQA，系统地探索输入树集合中所有可能分组的搜索空间，并使用集成的置换测试。实验结果表明，TreeQA能够非常有效地处理大规模关联分析，并且在关联分析方面比以往的方法更具有效性和鲁棒性。

{"title":"Quantitative Association Analysis Using Tree Hierarchies","authors":"Feng Pan, Lynda Yang, L. McMillan, F. P. Villena, D. Threadgill, Wei Wang","doi":"10.1109/ICDM.2008.100","DOIUrl":"https://doi.org/10.1109/ICDM.2008.100","url":null,"abstract":"Association analysis arises in many important applications such as bioinformatics and business intelligence. Given a large collection of measurements over a set of samples, association analysis aims to find dependencies of target variables to subsets of measurements. Most previous algorithms adopt a two-stage approach; they first group samples based on the similarity in the subset of measurements, and then they examine the association between these groups and the specified target variables without considering the inter-group similarities or alternative groupings. This can lead to cases where the strength of association depends significantly on arbitrary clustering choices. In this paper, we propose a tree-based method for quantitative association analysis. Tree hierarchies derived from sample similarities represent many possible sample groupings. They also provide a natural way to incorporate domain knowledge such as ontologies and to identify and remove outliers. Given a tree hierarchy, our association analysis evaluates all possible groupings and selects the one with strongest association to the target variable. We introduce an efficient algorithm, TreeQA, to systematically explore the search-space of all possible groupings in a set of input trees, with integrated permutation tests. Experimental results show that TreeQA is able to handlelarge-scale association analysis very efficiently and is more effective and robust in association analysis than previous methods.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129739164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SCS: A New Similarity Measure for Categorical Sequences SCS:一种新的分类序列相似性度量方法

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.43

Abdellali Kelil, Shengrui Wang

Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is to extract and make use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different positions. In this paper we propose SCS, a novel method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those representing noise. It constitutes an effective approach for measuring the similarity of data such as biological sequences, natural language texts and financial transactions. To show its effectiveness, we have tested SCS extensively on a range of datasets, and compared the results with those obtained by various mainstream algorithms.

在许多数据挖掘应用中，度量分类序列之间的相似性是一个基本过程。关键问题是提取和利用隐藏在这些序列中发现的时间顺序和结构依赖关系背后的重要特征。几乎所有现有的算法都是基于时间顺序的模式匹配，但是这些序列在时间顺序不同的位置上往往具有相似的结构特征。在本文中，我们提出了一种基于原始模式匹配方案的测量分类序列相似性的新方法SCS，该方案使得捕获时间和非时间依赖性成为可能。SCS捕获代表序列自然结构的重要模式，并减少代表噪声的模式的影响。它是测量生物序列、自然语言文本和金融交易等数据相似性的有效方法。为了证明其有效性，我们在一系列数据集上对SCS进行了广泛的测试，并将结果与各种主流算法的结果进行了比较。

引用次数: 13

Iterative Subgraph Mining for Principal Component Analysis 主成分分析的迭代子图挖掘

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.62

Hiroto Saigo, K. Tsuda

Graph mining methods enumerate frequent subgraphs efficiently, but they are not necessarily good features for machine learning due to high correlation among features. Thus it makes sense to perform principal component analysis to reduce the dimensionality and create decorrelated features. We present a novel iterative mining algorithm that captures informative patterns corresponding to major entries of top principal components. It repeatedly calls weighted substructure mining where example weights are updated in each iteration. The Lanczos algorithm, a standard algorithm of eigen decomposition, is employed to update the weights. In experiments, our patterns are shown to approximate the principal components obtained by frequent mining.

图挖掘方法可以有效地枚举频繁子图，但由于特征之间的高度相关性，它们不一定是机器学习的好特征。因此，执行主成分分析来降低维数并创建去相关特征是有意义的。我们提出了一种新的迭代挖掘算法，该算法捕获与顶主成分的主项相对应的信息模式。它反复调用加权子结构挖掘，在每次迭代中更新示例权重。采用标准的特征分解算法Lanczos算法更新权重。实验表明，我们的模式近似于频繁挖掘得到的主成分。

引用次数: 11

Generalized Framework for Syntax-Based Relation Mining 基于语法的关系挖掘的广义框架

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.153

Bonaventura Coppola, Alessandro Moschitti, Daniele Pighin

Supervised approaches to data mining are particularly appealing as they allow for the extraction of complex relations from data objects. In order to facilitate their application in different areas, ranging from protein to protein interaction in bioinformatics to text mining in computational linguistics research, a modular and general mining framework is needed. The major constraint to the generalization process concerns the feature design for the description of relational data. In this paper, we present a machine learning framework for the automatic mining of relations, where the target objects are structurally organized in a tree. Object types are generalized by means of the use of roles, whereas the relation properties are described by means of the underlying tree structure. The latter is encoded in the learning algorithm thanks to kernel methods for structured data, which represent structures in terms of their all possible subparts. This approach can be applied to any kind of data disregarding their very nature. Experiments with support vector machines on two text mining datasets for relation extraction, i.e. the PropBank and FrameNet corpora, show both that our approach is general, and that it reaches state-of-the-art accuracy.

数据挖掘的监督方法特别吸引人，因为它们允许从数据对象中提取复杂的关系。为了促进其在不同领域的应用，从生物信息学中的蛋白质与蛋白质相互作用到计算语言学研究中的文本挖掘，需要一个模块化和通用的挖掘框架。泛化过程的主要约束是关系数据描述的特征设计。在本文中，我们提出了一个用于自动挖掘关系的机器学习框架，其中目标对象在结构上组织在树中。对象类型通过使用角色进行一般化，而关系属性则通过底层树结构进行描述。后者在学习算法中编码，这要归功于结构化数据的核方法，它根据所有可能的子部分表示结构。这种方法可以应用于任何类型的数据，而不考虑它们的本质。在两个文本挖掘数据集(PropBank和FrameNet语料库)上使用支持向量机进行的关系提取实验表明，我们的方法是通用的，并且达到了最先进的精度。

{"title":"Generalized Framework for Syntax-Based Relation Mining","authors":"Bonaventura Coppola, Alessandro Moschitti, Daniele Pighin","doi":"10.1109/ICDM.2008.153","DOIUrl":"https://doi.org/10.1109/ICDM.2008.153","url":null,"abstract":"Supervised approaches to data mining are particularly appealing as they allow for the extraction of complex relations from data objects. In order to facilitate their application in different areas, ranging from protein to protein interaction in bioinformatics to text mining in computational linguistics research, a modular and general mining framework is needed. The major constraint to the generalization process concerns the feature design for the description of relational data. In this paper, we present a machine learning framework for the automatic mining of relations, where the target objects are structurally organized in a tree. Object types are generalized by means of the use of roles, whereas the relation properties are described by means of the underlying tree structure. The latter is encoded in the learning algorithm thanks to kernel methods for structured data, which represent structures in terms of their all possible subparts. This approach can be applied to any kind of data disregarding their very nature. Experiments with support vector machines on two text mining datasets for relation extraction, i.e. the PropBank and FrameNet corpora, show both that our approach is general, and that it reaches state-of-the-art accuracy.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"289 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131890432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Nonparametric Monotone Classification with MOCA 基于MOCA的非参数单调分类

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.54

N. Barile, A. Feelders

We describe a monotone classification algorithm called MOCA that attempts to minimize the mean absolute prediction error for classification problems with ordered class labels.We first find a monotone classifier with minimum L1 loss on the training sample, and then use a simple interpolation scheme to predict the class labels for attribute vectors not present in the training data.We compare MOCA to the ordinal stochastic dominance learner (OSDL), on artificial as well as real data sets. We show that MOCA often outperforms OSDL with respect to mean absolute prediction error.

我们描述了一种称为MOCA的单调分类算法，该算法试图最小化具有有序类标签的分类问题的平均绝对预测误差。我们首先找到一个在训练样本上L1损失最小的单调分类器，然后使用一个简单的插值方案来预测训练数据中不存在的属性向量的类标签。我们将MOCA与有序随机优势学习(OSDL)在人工和真实数据集上进行了比较。我们表明MOCA在平均绝对预测误差方面通常优于OSDL。

引用次数: 23

Dirichlet Process Based Evolutionary Clustering 基于Dirichlet过程的进化聚类

2008 Eighth IEEE International Conference on Data Mining

Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.23

Tianbing Xu, Zhongfei Zhang, Philip S. Yu, Bo Long

Evolutionary Clustering has emerged as an important research topic in recent literature of data mining, and solutions to this problem have found a wide spectrum of applications, particularly in social network analysis. In this paper, based on the recent literature on Dirichlet processes, we have developed two different and specific models as solutions to this problem: DPChain and HDP-EVO. Both models substantially advance the literature on evolutionary clustering in the sense that not only they both perform better than the existing literature, but more importantly they are capable of automatically learning the cluster numbers and structures during the evolution. Extensive evaluations have demonstrated the effectiveness and promise of these models against the state-of-the-art literature.

进化聚类是近年来数据挖掘领域的一个重要研究课题，其解决方案在社会网络分析中有着广泛的应用。在本文中，基于最近关于Dirichlet过程的文献，我们开发了两个不同的和具体的模型来解决这个问题:DPChain和HDP-EVO。这两种模型都大大推进了进化聚类的研究，它们不仅比现有的文献表现得更好，而且更重要的是它们能够在进化过程中自动学习聚类的数量和结构。广泛的评估已经证明了这些模型对最新文献的有效性和前景。

引用次数: 49

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 Eighth IEEE International Conference on Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀