An efficient scheme for automatic web pages categorization using the support vector machine

IF 0.8 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS New Review of Hypermedia and Multimedia Pub Date : 2016-07-01 DOI:10.1080/13614568.2016.1152316

V. Bhalla, N. Kumar

{"title":"An efficient scheme for automatic web pages categorization using the support vector machine","authors":"V. Bhalla, N. Kumar","doi":"10.1080/13614568.2016.1152316","DOIUrl":null,"url":null,"abstract":"ABSTRACT In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classification tool. The results obtained confirm the effectiveness of the proposed scheme in terms of its accuracy in different categories of web pages.","PeriodicalId":54386,"journal":{"name":"New Review of Hypermedia and Multimedia","volume":"22 1","pages":"223 - 242"},"PeriodicalIF":0.8000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/13614568.2016.1152316","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"New Review of Hypermedia and Multimedia","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/13614568.2016.1152316","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 13

Abstract

ABSTRACT In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classification tool. The results obtained confirm the effectiveness of the proposed scheme in terms of its accuracy in different categories of web pages.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种基于支持向量机的网页自动分类方法

近年来，随着互联网及相关技术的发展，互联网用户数量呈指数级增长。这些用户要求在几秒钟内从互联网上访问相关网页。为了实现这一目标，需要对网页内容进行有效的分类。对这数十亿个网页进行人工分类以达到高准确率是一项具有挑战性的任务。文献中报道的大多数现有技术都是半自动的。使用这些技术，无法达到更高的精度水平。为了实现这一目标，本文提出了一种基于领域分类的网页自动分类方法。建议的方案是基于对网页的具体和相关特征的识别。在该方案中，首先对特征进行提取和评价，然后对特征集进行过滤，用于领域网页的分类。提出了一种基于网页HTML文档对象模型的特征提取工具。特征提取和权值分配是基于考虑各个域页面的特定域关键字列表的集合。并根据关键字列表中关键字的id来缩减关键字列表。此外，对关键词和标签文本进行词干提取以达到更高的准确性。生成一个广泛的特征集来开发一个健壮的分类技术。采用特征提取与统计分析相结合的机器学习方法，以支持向量机核作为分类工具对所提方案进行评估。实验结果表明，该方法在不同类别的网页上的准确率是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

New Review of Hypermedia and Multimedia COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

3.40

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： The New Review of Hypermedia and Multimedia (NRHM) is an interdisciplinary journal providing a focus for research covering practical and theoretical developments in hypermedia, hypertext, and interactive multimedia.