基于Spark的大规模多标签集成学习

2017 IEEE Trustcom/BigDataSE/ICESS Pub Date : 2017-08-01 DOI:10.1109/Trustcom/BigDataSE/ICESS.2017.328

Jorge Gonzalez-Lopez, Alberto Cano, S. Ventura

{"title":"基于Spark的大规模多标签集成学习","authors":"Jorge Gonzalez-Lopez, Alberto Cano, S. Ventura","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.328","DOIUrl":null,"url":null,"abstract":"Multi-label learning is a challenging problem which has received growing attention in the research community over the last years. Hence, there is a growing demand of effective and scalable multi-label learning methods for larger datasets both in terms of number of instances and numbers of output labels. The use of ensemble classifiers is a popular approach for improving multi-label model accuracy, especially for datasets with high-dimensional label spaces. However, the increasing computational complexity of the algorithms in such ever-growing high-dimensional label spaces, requires new approaches to manage data effectively and efficiently in distributed computing environments. Spark is a framework based on MapReduce, a distributed programming model that offers a robust paradigm to handle large-scale datasets in a cluster of nodes. This paper focuses on multi-label ensembles and proposes a number of implementations through the use of parallel and distributed computing using Spark. Additionally, five different implementations are proposed and the impact on the performance of the ensemble is analyzed. The experimental study shows the benefits of using distributed implementations over the traditional single-node single-thread execution, in terms of performance over multiple metrics as well as significant speedup tested on 29 benchmark datasets.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Large-Scale Multi-label Ensemble Learning on Spark\",\"authors\":\"Jorge Gonzalez-Lopez, Alberto Cano, S. Ventura\",\"doi\":\"10.1109/Trustcom/BigDataSE/ICESS.2017.328\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-label learning is a challenging problem which has received growing attention in the research community over the last years. Hence, there is a growing demand of effective and scalable multi-label learning methods for larger datasets both in terms of number of instances and numbers of output labels. The use of ensemble classifiers is a popular approach for improving multi-label model accuracy, especially for datasets with high-dimensional label spaces. However, the increasing computational complexity of the algorithms in such ever-growing high-dimensional label spaces, requires new approaches to manage data effectively and efficiently in distributed computing environments. Spark is a framework based on MapReduce, a distributed programming model that offers a robust paradigm to handle large-scale datasets in a cluster of nodes. This paper focuses on multi-label ensembles and proposes a number of implementations through the use of parallel and distributed computing using Spark. Additionally, five different implementations are proposed and the impact on the performance of the ensemble is analyzed. The experimental study shows the benefits of using distributed implementations over the traditional single-node single-thread execution, in terms of performance over multiple metrics as well as significant speedup tested on 29 benchmark datasets.\",\"PeriodicalId\":170253,\"journal\":{\"name\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"volume\":\"82 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.328\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

多标签学习是一个具有挑战性的问题，近年来在研究界受到越来越多的关注。因此，在实例数量和输出标签数量方面，对于更大数据集的有效和可扩展的多标签学习方法的需求不断增长。使用集成分类器是提高多标签模型精度的一种流行方法，特别是对于具有高维标签空间的数据集。然而，在这种不断增长的高维标签空间中，算法的计算复杂性不断增加，需要新的方法来有效地管理分布式计算环境中的数据。Spark是一个基于MapReduce的框架，MapReduce是一个分布式编程模型，提供了一个健壮的范例来处理节点集群中的大规模数据集。本文重点研究了多标签集成，并提出了一些使用Spark并行和分布式计算的实现方法。此外，还提出了五种不同的实现方法，并分析了它们对集成性能的影响。实验研究表明，就多个指标的性能以及在29个基准数据集上测试的显著加速而言，使用分布式实现优于传统的单节点单线程执行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Large-Scale Multi-label Ensemble Learning on Spark

Multi-label learning is a challenging problem which has received growing attention in the research community over the last years. Hence, there is a growing demand of effective and scalable multi-label learning methods for larger datasets both in terms of number of instances and numbers of output labels. The use of ensemble classifiers is a popular approach for improving multi-label model accuracy, especially for datasets with high-dimensional label spaces. However, the increasing computational complexity of the algorithms in such ever-growing high-dimensional label spaces, requires new approaches to manage data effectively and efficiently in distributed computing environments. Spark is a framework based on MapReduce, a distributed programming model that offers a robust paradigm to handle large-scale datasets in a cluster of nodes. This paper focuses on multi-label ensembles and proposes a number of implementations through the use of parallel and distributed computing using Spark. Additionally, five different implementations are proposed and the impact on the performance of the ensemble is analyzed. The experimental study shows the benefits of using distributed implementations over the traditional single-node single-thread execution, in terms of performance over multiple metrics as well as significant speedup tested on 29 benchmark datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Trustcom/BigDataSE/ICESS

自引率

0.00%

发文量

期刊最新文献

Insider Threat Detection Through Attributed Graph Clustering SEEAD: A Semantic-Based Approach for Automatic Binary Code De-obfuscation A Public Key Encryption Scheme for String Identification Vehicle Incident Hot Spots Identification: An Approach for Big Data Implementing Chain of Custody Requirements in Database Audit Records for Forensic Purposes