An Efficient and Scalable MetaFeature-based Document Classification Approach based on Massively Parallel Computing

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2015-08-09 DOI:10.1145/2766462.2767743

Sérgio D. Canuto, Marcos André Gonçalves, W. M. D. Santos, Thierson Couto, W. Martins

{"title":"An Efficient and Scalable MetaFeature-based Document Classification Approach based on Massively Parallel Computing","authors":"Sérgio D. Canuto, Marcos André Gonçalves, W. M. D. Santos, Thierson Couto, W. Martins","doi":"10.1145/2766462.2767743","DOIUrl":null,"url":null,"abstract":"The unprecedented growth of available data nowadays has stimulated the development of new methods for organizing and extracting useful knowledge from this immense amount of data. Automatic Document Classification (ADC) is one of such methods, that uses machine learning techniques to build models capable of automatically associating documents to well-defined semantic classes. ADC is the basis of many important applications such as language identification, sentiment analysis, recommender systems, spam filtering, among others. Recently, the use of meta-features has been shown to substantially improve the effectiveness of ADC algorithms. In particular, the use of meta-features that make a combined use of local information (through kNN-based features) and global information (through category centroids) has produced promising results. However, the generation of these meta-features is very costly in terms of both, memory consumption and runtime since there is the need to constantly call the kNN algorithm. We take advantage of the current manycore GPU architecture and present a massively parallel version of the kNN algorithm for highly dimensional and sparse datasets (which is the case for ADC). Our experimental results show that we can obtain speedup gains of up to 15x while reducing memory consumption in more than 5000x when compared to a state-of-the-art parallel baseline. This opens up the possibility of applying meta-features based classification in large collections of documents, that would otherwise take too much time or require the use of an expensive computational platform.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2766462.2767743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

The unprecedented growth of available data nowadays has stimulated the development of new methods for organizing and extracting useful knowledge from this immense amount of data. Automatic Document Classification (ADC) is one of such methods, that uses machine learning techniques to build models capable of automatically associating documents to well-defined semantic classes. ADC is the basis of many important applications such as language identification, sentiment analysis, recommender systems, spam filtering, among others. Recently, the use of meta-features has been shown to substantially improve the effectiveness of ADC algorithms. In particular, the use of meta-features that make a combined use of local information (through kNN-based features) and global information (through category centroids) has produced promising results. However, the generation of these meta-features is very costly in terms of both, memory consumption and runtime since there is the need to constantly call the kNN algorithm. We take advantage of the current manycore GPU architecture and present a massively parallel version of the kNN algorithm for highly dimensional and sparse datasets (which is the case for ADC). Our experimental results show that we can obtain speedup gains of up to 15x while reducing memory consumption in more than 5000x when compared to a state-of-the-art parallel baseline. This opens up the possibility of applying meta-features based classification in large collections of documents, that would otherwise take too much time or require the use of an expensive computational platform.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于大规模并行计算的高效可扩展元特征文档分类方法

如今，可用数据的空前增长刺激了从海量数据中组织和提取有用知识的新方法的发展。自动文档分类(ADC)就是这样一种方法，它使用机器学习技术来构建能够自动将文档与定义良好的语义类关联起来的模型。ADC是许多重要应用的基础，如语言识别、情感分析、推荐系统、垃圾邮件过滤等。最近，元特征的使用已被证明可以大大提高ADC算法的有效性。特别是，结合使用局部信息(通过基于knn的特征)和全局信息(通过类别质心)的元特征的使用产生了有希望的结果。然而，这些元特征的生成在内存消耗和运行时间方面都非常昂贵，因为需要不断调用kNN算法。我们利用当前的多核GPU架构，为高维和稀疏数据集(ADC的情况)提供了大规模并行版本的kNN算法。我们的实验结果表明，与最先进的并行基线相比，我们可以获得高达15倍的加速增益，同时将内存消耗减少5000倍以上。这开启了在大型文档集合中应用基于元特征的分类的可能性，否则将花费太多时间或需要使用昂贵的计算平台。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量

期刊最新文献

Regularised Cross-Modal Hashing Adapted B-CUBED Metrics to Unbalanced Datasets Incorporating Non-sequential Behavior into Click Models Time Pressure in Information Search Modeling Multi-query Retrieval Tasks Using Density Matrix Transformation