{"title":"Automatic Web News Content Extraction Based on Similar Pages","authors":"Chunyuan Zhang, Z. Lin","doi":"10.1109/WISM.2010.154","DOIUrl":null,"url":null,"abstract":"Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3% .","PeriodicalId":119569,"journal":{"name":"2010 International Conference on Web Information Systems and Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Web Information Systems and Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISM.2010.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3% .