{"title":"Web Content Extraction based on Webpage Layout Analysis","authors":"Lei Fu, Yao Meng, Yingju Xia, Hao Yu","doi":"10.1109/ITCS.2010.16","DOIUrl":null,"url":null,"abstract":"for web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. To some extent, all these methods ignore the layout information of the webpage, although the layout information such as the spatial and visual cues often plays a very important role in the process of locating the main content of the webpage when browsing. As a consequence, these methods often throw part of the main content away when extracting content from the webpage. In this paper, we present a method which combines webpage layout analysis with DOM tree rule-base method, it can make full use of the advantages of the two methods. It uses the layout information to guide the extraction work with a global view and can gain a better performance than the traditional methods.","PeriodicalId":340471,"journal":{"name":"2010 Second International Conference on Information Technology and Computer Science","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Second International Conference on Information Technology and Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITCS.2010.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28
Abstract
for web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. To some extent, all these methods ignore the layout information of the webpage, although the layout information such as the spatial and visual cues often plays a very important role in the process of locating the main content of the webpage when browsing. As a consequence, these methods often throw part of the main content away when extracting content from the webpage. In this paper, we present a method which combines webpage layout analysis with DOM tree rule-base method, it can make full use of the advantages of the two methods. It uses the layout information to guide the extraction work with a global view and can gain a better performance than the traditional methods.