{"title":"Implicit Links based Web Page Representation for Web Page Classification","authors":"Abdelbadie Belmouhcine, M. Benkhalifa","doi":"10.1145/2797115.2797125","DOIUrl":null,"url":null,"abstract":"With the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an adequate representation of web pages is essential for a better web page classification. In this paper, we propose a web page representation based on the structure of the implicit graph built using implicit links extracted from the query-log. In this representation, we represent web pages using their textual contents along with their neighbors as features instead of using features of their neighbors. When two or more web pages in the implicit graph share the same direct neighbors and belong to the same class ci, it is most likely that every other web page, having the same immediate neighbors, will belong to the same class ci. We propose two kinds of web page representations: Boolean Neighbor Vector (BNV) and Weighted Neighbor Vector (WNV). In BNV, we supplement the feature vector, which represents the textual content of a web page, by a Boolean vector. This vector represents the target web page's neighbors and shows whether a web page is a direct neighbor of the target web page or not. In WNV, we supplement the feature vector, which represents the textual content of a web page, by a weighted vector. This latter represents the target web page's neighbors and shows strengths of relations between the target web page and its neighbors. We conduct experiments using four classifiers: SVM (Support Vector Machine), NB (Naive Bayes), RF (Random Forest) and KNN (K-Nearest Neighbors) on two subsets of ODP (Open Directory Project). Results show that: (1) the proposed representation helps obtain better classification results when using SVM, NB, RF and KNN for both Bag of Words (BW) and 5-gram representations. (2) The performances based on BNV are better than those based on WNV.","PeriodicalId":386229,"journal":{"name":"Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2797115.2797125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
With the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an adequate representation of web pages is essential for a better web page classification. In this paper, we propose a web page representation based on the structure of the implicit graph built using implicit links extracted from the query-log. In this representation, we represent web pages using their textual contents along with their neighbors as features instead of using features of their neighbors. When two or more web pages in the implicit graph share the same direct neighbors and belong to the same class ci, it is most likely that every other web page, having the same immediate neighbors, will belong to the same class ci. We propose two kinds of web page representations: Boolean Neighbor Vector (BNV) and Weighted Neighbor Vector (WNV). In BNV, we supplement the feature vector, which represents the textual content of a web page, by a Boolean vector. This vector represents the target web page's neighbors and shows whether a web page is a direct neighbor of the target web page or not. In WNV, we supplement the feature vector, which represents the textual content of a web page, by a weighted vector. This latter represents the target web page's neighbors and shows strengths of relations between the target web page and its neighbors. We conduct experiments using four classifiers: SVM (Support Vector Machine), NB (Naive Bayes), RF (Random Forest) and KNN (K-Nearest Neighbors) on two subsets of ODP (Open Directory Project). Results show that: (1) the proposed representation helps obtain better classification results when using SVM, NB, RF and KNN for both Bag of Words (BW) and 5-gram representations. (2) The performances based on BNV are better than those based on WNV.
随着网络规模的快速增长,网页分类变得更加突出。网页的表示方式和用于这种表示的上下文特征都对分类的性能有影响。因此,找到一个适当的网页表示对于更好的网页分类是必不可少的。在本文中,我们提出了一种基于从查询日志中提取的隐式链接构建的隐式图结构的网页表示。在这种表示中,我们使用网页的文本内容及其邻居作为特征来表示网页,而不是使用其邻居的特征。当隐式图中的两个或多个网页共享相同的直接邻居并属于同一类ci时,最有可能的是,具有相同的直接邻居的所有其他网页都属于同一类ci。我们提出了两种网页表示:布尔邻居向量(BNV)和加权邻居向量(WNV)。在BNV中,我们用布尔向量来补充表示网页文本内容的特征向量。此向量表示目标网页的邻居,并显示网页是否为目标网页的直接邻居。在WNV中,我们用加权向量来补充表示网页文本内容的特征向量。后者表示目标网页的邻居,并显示目标网页与其邻居之间的关系强度。我们在ODP (Open Directory Project)的两个子集上使用支持向量机(SVM)、朴素贝叶斯(NB)、随机森林(RF)和k近邻(KNN)四种分类器进行了实验。结果表明:(1)所提出的表示方法在使用SVM、NB、RF和KNN对Bag of Words (BW)和5-gram表示时都能获得更好的分类效果。(2)基于BNV的性能优于基于WNV的性能。