In this paper, a novel unsupervised approach to mining categories from action video sequences is presented. This approach consists of two modules: action representation and learning model. Videos are regarded as spatially distributed dynamic pixel time series, which are quantized into pixel prototypes. After replacing the pixel time series with their corresponding prototype labels, the video sequences are compressed into 2D action matrices. We put these matrices together to form an multi-action tensor, and propose the joint matrix factorization method to simultaneously cluster the pixel prototypes into pixel signatures, and matrices into action classes. The approach is tested on public and popular Weizmann data set, and promising results are achieved.
{"title":"A Joint Matrix Factorization Approach to Unsupervised Action Categorization","authors":"Peng Cui, Fei Wang, Lifeng Sun, Shiqiang Yang","doi":"10.1109/ICDM.2008.59","DOIUrl":"https://doi.org/10.1109/ICDM.2008.59","url":null,"abstract":"In this paper, a novel unsupervised approach to mining categories from action video sequences is presented. This approach consists of two modules: action representation and learning model. Videos are regarded as spatially distributed dynamic pixel time series, which are quantized into pixel prototypes. After replacing the pixel time series with their corresponding prototype labels, the video sequences are compressed into 2D action matrices. We put these matrices together to form an multi-action tensor, and propose the joint matrix factorization method to simultaneously cluster the pixel prototypes into pixel signatures, and matrices into action classes. The approach is tested on public and popular Weizmann data set, and promising results are achieved.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"8 4-5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115322862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Class imbalance is a ubiquitous problem in supervised learning and has gained wide-scale attention in the literature. Perhaps the most prevalent solution is to apply sampling to training data in order improve classifier performance. The typical approach will apply uniform levels of sampling globally. However, we believe that data is typically multi-modal, which suggests sampling should be treated locally rather than globally. It is the purpose of this paper to propose a framework which first identifies meaningful regions of data and then proceeds to find optimal sampling levels within each. This paper demonstrates that a global classifier trained on data locally sampled produces superior rank-orderings on a wide range of real-world and artificial datasets as compared to contemporary global sampling methods.
{"title":"Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data","authors":"David A. Cieslak, N. Chawla","doi":"10.1109/ICDM.2008.87","DOIUrl":"https://doi.org/10.1109/ICDM.2008.87","url":null,"abstract":"Class imbalance is a ubiquitous problem in supervised learning and has gained wide-scale attention in the literature. Perhaps the most prevalent solution is to apply sampling to training data in order improve classifier performance. The typical approach will apply uniform levels of sampling globally. However, we believe that data is typically multi-modal, which suggests sampling should be treated locally rather than globally. It is the purpose of this paper to propose a framework which first identifies meaningful regions of data and then proceeds to find optimal sampling levels within each. This paper demonstrates that a global classifier trained on data locally sampled produces superior rank-orderings on a wide range of real-world and artificial datasets as compared to contemporary global sampling methods.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"28 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120806899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Here we propose a novel machine learning method for time series forecasting which is based on the widely-used Least Squares Support Vector Machine (LS-SVM) approach. The objective function of our method contains a weighted variance minimization part as well. This modification makes the method more efficient in time series forecasting, as this paper will show. The proposed method is a generalization of the well-known LS-SVM algorithm. It has similar advantages like the applicability of the kernel-trick, it has a linear and unique solution, and a short computational time, but can perform better in certain scenarios. The main purpose of this paper is to introduce the novel Variance Minimization Least Squares Support Vector Machine (VMLS-SVM) method and to show its superiority through experimental results using standard benchmark time series prediction datasets.
{"title":"Variance Minimization Least Squares Support Vector Machines for Time Series Analysis","authors":"Róbert Ormándi","doi":"10.1109/ICDM.2008.79","DOIUrl":"https://doi.org/10.1109/ICDM.2008.79","url":null,"abstract":"Here we propose a novel machine learning method for time series forecasting which is based on the widely-used Least Squares Support Vector Machine (LS-SVM) approach. The objective function of our method contains a weighted variance minimization part as well. This modification makes the method more efficient in time series forecasting, as this paper will show. The proposed method is a generalization of the well-known LS-SVM algorithm. It has similar advantages like the applicability of the kernel-trick, it has a linear and unique solution, and a short computational time, but can perform better in certain scenarios. The main purpose of this paper is to introduce the novel Variance Minimization Least Squares Support Vector Machine (VMLS-SVM) method and to show its superiority through experimental results using standard benchmark time series prediction datasets.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128529522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The objective of credit scoring model is to categorize the applicants as either accepted or rejected debtors prior to granting credit. A modified logistic loss function is proposed which can approximate hinge loss and therefore the resulting model, maximum margin logistic regression (MMLR), has the classification capability of support vector machine (SVM) with low computational cost. Finally, to classify credit applicants, an efficient algorithm is also described for MMLR based on epsilon-boosting which can provide sparse estimation of coefficients for better stability and interpretability.
{"title":"Sparse Maximum Margin Logistic Regression for Credit Scoring","authors":"Sabyasachi Patra, K. Shanker, D. Kundu","doi":"10.1109/ICDM.2008.84","DOIUrl":"https://doi.org/10.1109/ICDM.2008.84","url":null,"abstract":"The objective of credit scoring model is to categorize the applicants as either accepted or rejected debtors prior to granting credit. A modified logistic loss function is proposed which can approximate hinge loss and therefore the resulting model, maximum margin logistic regression (MMLR), has the classification capability of support vector machine (SVM) with low computational cost. Finally, to classify credit applicants, an efficient algorithm is also described for MMLR based on epsilon-boosting which can provide sparse estimation of coefficients for better stability and interpretability.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128568562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Pan, Lynda Yang, L. McMillan, F. P. Villena, D. Threadgill, Wei Wang
Association analysis arises in many important applications such as bioinformatics and business intelligence. Given a large collection of measurements over a set of samples, association analysis aims to find dependencies of target variables to subsets of measurements. Most previous algorithms adopt a two-stage approach; they first group samples based on the similarity in the subset of measurements, and then they examine the association between these groups and the specified target variables without considering the inter-group similarities or alternative groupings. This can lead to cases where the strength of association depends significantly on arbitrary clustering choices. In this paper, we propose a tree-based method for quantitative association analysis. Tree hierarchies derived from sample similarities represent many possible sample groupings. They also provide a natural way to incorporate domain knowledge such as ontologies and to identify and remove outliers. Given a tree hierarchy, our association analysis evaluates all possible groupings and selects the one with strongest association to the target variable. We introduce an efficient algorithm, TreeQA, to systematically explore the search-space of all possible groupings in a set of input trees, with integrated permutation tests. Experimental results show that TreeQA is able to handlelarge-scale association analysis very efficiently and is more effective and robust in association analysis than previous methods.
{"title":"Quantitative Association Analysis Using Tree Hierarchies","authors":"Feng Pan, Lynda Yang, L. McMillan, F. P. Villena, D. Threadgill, Wei Wang","doi":"10.1109/ICDM.2008.100","DOIUrl":"https://doi.org/10.1109/ICDM.2008.100","url":null,"abstract":"Association analysis arises in many important applications such as bioinformatics and business intelligence. Given a large collection of measurements over a set of samples, association analysis aims to find dependencies of target variables to subsets of measurements. Most previous algorithms adopt a two-stage approach; they first group samples based on the similarity in the subset of measurements, and then they examine the association between these groups and the specified target variables without considering the inter-group similarities or alternative groupings. This can lead to cases where the strength of association depends significantly on arbitrary clustering choices. In this paper, we propose a tree-based method for quantitative association analysis. Tree hierarchies derived from sample similarities represent many possible sample groupings. They also provide a natural way to incorporate domain knowledge such as ontologies and to identify and remove outliers. Given a tree hierarchy, our association analysis evaluates all possible groupings and selects the one with strongest association to the target variable. We introduce an efficient algorithm, TreeQA, to systematically explore the search-space of all possible groupings in a set of input trees, with integrated permutation tests. Experimental results show that TreeQA is able to handlelarge-scale association analysis very efficiently and is more effective and robust in association analysis than previous methods.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129739164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is to extract and make use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different positions. In this paper we propose SCS, a novel method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those representing noise. It constitutes an effective approach for measuring the similarity of data such as biological sequences, natural language texts and financial transactions. To show its effectiveness, we have tested SCS extensively on a range of datasets, and compared the results with those obtained by various mainstream algorithms.
{"title":"SCS: A New Similarity Measure for Categorical Sequences","authors":"Abdellali Kelil, Shengrui Wang","doi":"10.1109/ICDM.2008.43","DOIUrl":"https://doi.org/10.1109/ICDM.2008.43","url":null,"abstract":"Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is to extract and make use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different positions. In this paper we propose SCS, a novel method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those representing noise. It constitutes an effective approach for measuring the similarity of data such as biological sequences, natural language texts and financial transactions. To show its effectiveness, we have tested SCS extensively on a range of datasets, and compared the results with those obtained by various mainstream algorithms.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121472846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph mining methods enumerate frequent subgraphs efficiently, but they are not necessarily good features for machine learning due to high correlation among features. Thus it makes sense to perform principal component analysis to reduce the dimensionality and create decorrelated features. We present a novel iterative mining algorithm that captures informative patterns corresponding to major entries of top principal components. It repeatedly calls weighted substructure mining where example weights are updated in each iteration. The Lanczos algorithm, a standard algorithm of eigen decomposition, is employed to update the weights. In experiments, our patterns are shown to approximate the principal components obtained by frequent mining.
{"title":"Iterative Subgraph Mining for Principal Component Analysis","authors":"Hiroto Saigo, K. Tsuda","doi":"10.1109/ICDM.2008.62","DOIUrl":"https://doi.org/10.1109/ICDM.2008.62","url":null,"abstract":"Graph mining methods enumerate frequent subgraphs efficiently, but they are not necessarily good features for machine learning due to high correlation among features. Thus it makes sense to perform principal component analysis to reduce the dimensionality and create decorrelated features. We present a novel iterative mining algorithm that captures informative patterns corresponding to major entries of top principal components. It repeatedly calls weighted substructure mining where example weights are updated in each iteration. The Lanczos algorithm, a standard algorithm of eigen decomposition, is employed to update the weights. In experiments, our patterns are shown to approximate the principal components obtained by frequent mining.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127639873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Supervised approaches to data mining are particularly appealing as they allow for the extraction of complex relations from data objects. In order to facilitate their application in different areas, ranging from protein to protein interaction in bioinformatics to text mining in computational linguistics research, a modular and general mining framework is needed. The major constraint to the generalization process concerns the feature design for the description of relational data. In this paper, we present a machine learning framework for the automatic mining of relations, where the target objects are structurally organized in a tree. Object types are generalized by means of the use of roles, whereas the relation properties are described by means of the underlying tree structure. The latter is encoded in the learning algorithm thanks to kernel methods for structured data, which represent structures in terms of their all possible subparts. This approach can be applied to any kind of data disregarding their very nature. Experiments with support vector machines on two text mining datasets for relation extraction, i.e. the PropBank and FrameNet corpora, show both that our approach is general, and that it reaches state-of-the-art accuracy.
{"title":"Generalized Framework for Syntax-Based Relation Mining","authors":"Bonaventura Coppola, Alessandro Moschitti, Daniele Pighin","doi":"10.1109/ICDM.2008.153","DOIUrl":"https://doi.org/10.1109/ICDM.2008.153","url":null,"abstract":"Supervised approaches to data mining are particularly appealing as they allow for the extraction of complex relations from data objects. In order to facilitate their application in different areas, ranging from protein to protein interaction in bioinformatics to text mining in computational linguistics research, a modular and general mining framework is needed. The major constraint to the generalization process concerns the feature design for the description of relational data. In this paper, we present a machine learning framework for the automatic mining of relations, where the target objects are structurally organized in a tree. Object types are generalized by means of the use of roles, whereas the relation properties are described by means of the underlying tree structure. The latter is encoded in the learning algorithm thanks to kernel methods for structured data, which represent structures in terms of their all possible subparts. This approach can be applied to any kind of data disregarding their very nature. Experiments with support vector machines on two text mining datasets for relation extraction, i.e. the PropBank and FrameNet corpora, show both that our approach is general, and that it reaches state-of-the-art accuracy.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"289 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131890432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe a monotone classification algorithm called MOCA that attempts to minimize the mean absolute prediction error for classification problems with ordered class labels.We first find a monotone classifier with minimum L1 loss on the training sample, and then use a simple interpolation scheme to predict the class labels for attribute vectors not present in the training data.We compare MOCA to the ordinal stochastic dominance learner (OSDL), on artificial as well as real data sets. We show that MOCA often outperforms OSDL with respect to mean absolute prediction error.
{"title":"Nonparametric Monotone Classification with MOCA","authors":"N. Barile, A. Feelders","doi":"10.1109/ICDM.2008.54","DOIUrl":"https://doi.org/10.1109/ICDM.2008.54","url":null,"abstract":"We describe a monotone classification algorithm called MOCA that attempts to minimize the mean absolute prediction error for classification problems with ordered class labels.We first find a monotone classifier with minimum L1 loss on the training sample, and then use a simple interpolation scheme to predict the class labels for attribute vectors not present in the training data.We compare MOCA to the ordinal stochastic dominance learner (OSDL), on artificial as well as real data sets. We show that MOCA often outperforms OSDL with respect to mean absolute prediction error.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126584466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianbing Xu, Zhongfei Zhang, Philip S. Yu, Bo Long
Evolutionary Clustering has emerged as an important research topic in recent literature of data mining, and solutions to this problem have found a wide spectrum of applications, particularly in social network analysis. In this paper, based on the recent literature on Dirichlet processes, we have developed two different and specific models as solutions to this problem: DPChain and HDP-EVO. Both models substantially advance the literature on evolutionary clustering in the sense that not only they both perform better than the existing literature, but more importantly they are capable of automatically learning the cluster numbers and structures during the evolution. Extensive evaluations have demonstrated the effectiveness and promise of these models against the state-of-the-art literature.
{"title":"Dirichlet Process Based Evolutionary Clustering","authors":"Tianbing Xu, Zhongfei Zhang, Philip S. Yu, Bo Long","doi":"10.1109/ICDM.2008.23","DOIUrl":"https://doi.org/10.1109/ICDM.2008.23","url":null,"abstract":"Evolutionary Clustering has emerged as an important research topic in recent literature of data mining, and solutions to this problem have found a wide spectrum of applications, particularly in social network analysis. In this paper, based on the recent literature on Dirichlet processes, we have developed two different and specific models as solutions to this problem: DPChain and HDP-EVO. Both models substantially advance the literature on evolutionary clustering in the sense that not only they both perform better than the existing literature, but more importantly they are capable of automatically learning the cluster numbers and structures during the evolution. Extensive evaluations have demonstrated the effectiveness and promise of these models against the state-of-the-art literature.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127633352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}