{"title":"Leveraging Web 2.0 Sources for Web Content Classification","authors":"Somnath Banerjee, Martin Scholz","doi":"10.1109/WIIAT.2008.291","DOIUrl":null,"url":null,"abstract":"This paper addresses practical aspects of Web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the Web. We study techniques for building training corpora automatically from publicly available Web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the Web.","PeriodicalId":393772,"journal":{"name":"2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WIIAT.2008.291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper addresses practical aspects of Web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the Web. We study techniques for building training corpora automatically from publicly available Web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the Web.