Artificial bee colony algorithm for feature selection and improved support vector machine for text classification

IF 2.6 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Information Discovery and Delivery Pub Date : 2019-08-19 DOI:10.1108/IDD-09-2018-0045

J. Balakumar, S. Mohan

{"title":"Artificial bee colony algorithm for feature selection and improved support vector machine for text classification","authors":"J. Balakumar, S. Mohan","doi":"10.1108/IDD-09-2018-0045","DOIUrl":null,"url":null,"abstract":"\nPurpose\nOwing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content.\n\n\nDesign/methodology/approach\nThis paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper.\n\n\nFindings\nThe experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy.\n\n\nOriginality/value\nThis paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.\n","PeriodicalId":43488,"journal":{"name":"Information Discovery and Delivery","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1108/IDD-09-2018-0045","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Discovery and Delivery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/IDD-09-2018-0045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 14

Abstract

Purpose Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content. Design/methodology/approach This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper. Findings The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy. Originality/value This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人工蜂群算法用于特征选择和改进的支持向量机用于文本分类

目的由于互联网上的海量文档，文本分类成为处理这些文档的必要任务。为了获得最佳的文本分类结果，特征选择是文本分类的一个重要阶段，通过选择合适的特征来降低文本文档的维数。本研究的主要目的是对个人电脑文档进行内容分类。为了提高文本分类的准确率，本文提出了一种基于人工蜂群(ABCFS)的特征选择算法。该算法与现有的信息增益和χ2统计等特征选择方法不同，采用真实数据集和基准数据集对算法进行了检验。为了验证该算法的有效性，本文采用了支持向量机(SVM)和改进的SVM分类器。实验结果在真实数据集和基准数据集上进行。真实数据集以文档的形式收集，存储在个人计算机中，基准数据集收集自Reuters和20 Newsgroups语料库。实验结果证明了所提特征选择算法的有效性，提高了文本文档的分类准确率。提出了一种新的ABCFS特征选择算法，对ABCFS算法的效率进行了评价，并对支持向量机进行了改进。本文采用ABCFS算法从文本(非结构化)文档中选择特征。虽然现有工作中没有文本特征选择算法，但使用ABCFS算法来选择数据(结构化)特征。该算法将根据内容自动对文档进行分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Discovery and Delivery INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

5.40

自引率

4.80%

发文量

期刊介绍： Information Discovery and Delivery covers information discovery and access for digital information researchers. This includes educators, knowledge professionals in education and cultural organisations, knowledge managers in media, health care and government, as well as librarians. The journal publishes research and practice which explores the digital information supply chain ie transport, flows, tracking, exchange and sharing, including within and between libraries. It is also interested in digital information capture, packaging and storage by ‘collectors’ of all kinds. Information is widely defined, including but not limited to: Records, Documents, Learning objects, Visual and sound files, Data and metadata and , User-generated content.