旋转森林和随机预言器:两种分类器集成方法

Proceedings. IEEE International Symposium on Computer-Based Medical Systems Pub Date : 2007-06-20 DOI:10.1109/CBMS.2007.94

Juan José Rodríguez Diez

{"title":"旋转森林和随机预言器:两种分类器集成方法","authors":"Juan José Rodríguez Diez","doi":"10.1109/CBMS.2007.94","DOIUrl":null,"url":null,"abstract":"Classification methods are widely used in computer-based medical systems. Often, the accuracy of a classifier can be improved using a classifier ensemble, the combination of several classifiers. Two classifiers ensembles and their results on several medical data sets will be presented: Rotation Forest (Rodriguez, Kuncheva and Alonso) and Random Oracles (Kuncheva and Rodriguez). Rotation Forest is a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name \"forest.\" Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Comparisons with various standard ensemble methods (Bagging, AdaBoost, and Random Forest) will be reported. Diversity-error diagrams reveal that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest and more diverse than these in Bagging, sometimes more accurate as well. A random oracle classifier is a mini-ensemble formed by a pair of classifiers and a fixed, randomly created oracle that selects between them. The random oracle can be thought of as a random discriminant function which splits the data into two subsets with no regard of any class labels or cluster structure. Two random oracles has been considered: linear and spherical. A random oracle classifier can be used as the base classifier of any ensemble method. It is argued that this approach encourages extra diversity in the ensemble while allowing for high accuracy of the individual ensemble members. Experiments with several data sets from UCI and 11 ensemble models will be reported. Each ensemble model will be examined with and without the oracle. The results will show that all ensemble methods benefited from the new approach, most markedly so random subspace and bagging. A further experiment with seven real medical data sets will demonstrate the validity of these findings outside the UCI data collection. When using Naive Bayes Classifiers as base classifiers, the experiments show that ensembles based solely upon the spherical oracle (and no other ensemble heuristic) outrank Bagging, Wagging, Random Subspaces, AdaBoost.Ml, MultiBoost and Decorate. Moreover, all these ensemble methods are better with any of the two random oracles than their standard versions without the oracles.","PeriodicalId":74567,"journal":{"name":"Proceedings. IEEE International Symposium on Computer-Based Medical Systems","volume":"178 1","pages":"3"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Rotation Forest and Random Oracles: Two Classifier Ensemble Methods\",\"authors\":\"Juan José Rodríguez Diez\",\"doi\":\"10.1109/CBMS.2007.94\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Classification methods are widely used in computer-based medical systems. Often, the accuracy of a classifier can be improved using a classifier ensemble, the combination of several classifiers. Two classifiers ensembles and their results on several medical data sets will be presented: Rotation Forest (Rodriguez, Kuncheva and Alonso) and Random Oracles (Kuncheva and Rodriguez). Rotation Forest is a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name \\\"forest.\\\" Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Comparisons with various standard ensemble methods (Bagging, AdaBoost, and Random Forest) will be reported. Diversity-error diagrams reveal that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest and more diverse than these in Bagging, sometimes more accurate as well. A random oracle classifier is a mini-ensemble formed by a pair of classifiers and a fixed, randomly created oracle that selects between them. The random oracle can be thought of as a random discriminant function which splits the data into two subsets with no regard of any class labels or cluster structure. Two random oracles has been considered: linear and spherical. A random oracle classifier can be used as the base classifier of any ensemble method. It is argued that this approach encourages extra diversity in the ensemble while allowing for high accuracy of the individual ensemble members. Experiments with several data sets from UCI and 11 ensemble models will be reported. Each ensemble model will be examined with and without the oracle. The results will show that all ensemble methods benefited from the new approach, most markedly so random subspace and bagging. A further experiment with seven real medical data sets will demonstrate the validity of these findings outside the UCI data collection. When using Naive Bayes Classifiers as base classifiers, the experiments show that ensembles based solely upon the spherical oracle (and no other ensemble heuristic) outrank Bagging, Wagging, Random Subspaces, AdaBoost.Ml, MultiBoost and Decorate. Moreover, all these ensemble methods are better with any of the two random oracles than their standard versions without the oracles.\",\"PeriodicalId\":74567,\"journal\":{\"name\":\"Proceedings. IEEE International Symposium on Computer-Based Medical Systems\",\"volume\":\"178 1\",\"pages\":\"3\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE International Symposium on Computer-Based Medical Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CBMS.2007.94\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Symposium on Computer-Based Medical Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS.2007.94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

分类方法广泛应用于基于计算机的医疗系统中。通常，可以使用分类器集成(多个分类器的组合)来提高分类器的准确性。将介绍两种分类器集合及其在若干医疗数据集上的结果:轮换森林(Rodriguez, Kuncheva和Alonso)和随机预言器(Kuncheva和Rodriguez)。旋转森林是一种基于特征提取的分类器集成生成方法。为了创建基分类器的训练数据，将特征集随机分成K个子集(K是算法的一个参数)，并对每个子集应用主成分分析(PCA)。为了保留数据中的变异性信息，保留了所有主成分。因此，发生K轴旋转以形成基本分类器的新特征。旋转方法的想法是同时鼓励个人的准确性和多样性在整体。通过对每个基分类器的特征提取来提升多样性。这里选择决策树是因为它们对特征轴的旋转很敏感，因此被称为“森林”。准确性是通过保留所有主成分和使用整个数据集来训练每个基分类器来寻求的。将报告与各种标准集成方法(Bagging, AdaBoost和Random Forest)的比较。多样性误差图显示，旋转森林集成构建的单个分类器比AdaBoost和Random Forest中的分类器更准确，比Bagging中的分类器更多样化，有时也更准确。随机oracle分类器是由一对分类器和一个固定的、随机创建的、在它们之间进行选择的oracle组成的小型集合。随机oracle可以被认为是一个随机判别函数，它将数据分成两个子集，而不考虑任何类标签或聚类结构。考虑了两种随机的神谕:线性的和球形的。随机oracle分类器可以作为任何集成方法的基础分类器。有人认为，这种方法鼓励了集合中额外的多样性，同时允许单个集合成员的高精度。本文将报道使用来自UCI和11个集成模型的几个数据集的实验。每个集成模型将在有或没有oracle的情况下进行检查。结果表明，所有的集成方法都受益于新方法，其中最明显的是随机子空间和套袋。对七个真实医疗数据集的进一步实验将证明这些发现在UCI数据收集之外的有效性。当使用朴素贝叶斯分类器作为基本分类器时，实验表明，仅基于球形预测(而没有其他集成启发式)的集成优于Bagging, Wagging, Random Subspaces, AdaBoost。Ml，多重增强和装饰。此外，所有这些集成方法使用任意两种随机oracle都比不使用oracle的标准版本要好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Rotation Forest and Random Oracles: Two Classifier Ensemble Methods

Classification methods are widely used in computer-based medical systems. Often, the accuracy of a classifier can be improved using a classifier ensemble, the combination of several classifiers. Two classifiers ensembles and their results on several medical data sets will be presented: Rotation Forest (Rodriguez, Kuncheva and Alonso) and Random Oracles (Kuncheva and Rodriguez). Rotation Forest is a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name "forest." Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Comparisons with various standard ensemble methods (Bagging, AdaBoost, and Random Forest) will be reported. Diversity-error diagrams reveal that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest and more diverse than these in Bagging, sometimes more accurate as well. A random oracle classifier is a mini-ensemble formed by a pair of classifiers and a fixed, randomly created oracle that selects between them. The random oracle can be thought of as a random discriminant function which splits the data into two subsets with no regard of any class labels or cluster structure. Two random oracles has been considered: linear and spherical. A random oracle classifier can be used as the base classifier of any ensemble method. It is argued that this approach encourages extra diversity in the ensemble while allowing for high accuracy of the individual ensemble members. Experiments with several data sets from UCI and 11 ensemble models will be reported. Each ensemble model will be examined with and without the oracle. The results will show that all ensemble methods benefited from the new approach, most markedly so random subspace and bagging. A further experiment with seven real medical data sets will demonstrate the validity of these findings outside the UCI data collection. When using Naive Bayes Classifiers as base classifiers, the experiments show that ensembles based solely upon the spherical oracle (and no other ensemble heuristic) outrank Bagging, Wagging, Random Subspaces, AdaBoost.Ml, MultiBoost and Decorate. Moreover, all these ensemble methods are better with any of the two random oracles than their standard versions without the oracles.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. IEEE International Symposium on Computer-Based Medical Systems

自引率

0.00%

发文量