Bo Wu, Xueqi Cheng, Yu Wang, Gang Zhang, Guodong Ding
{"title":"通过页面分析促进包装器生成","authors":"Bo Wu, Xueqi Cheng, Yu Wang, Gang Zhang, Guodong Ding","doi":"10.1109/ISI.2009.5137299","DOIUrl":null,"url":null,"abstract":"Current approaches for generating wrappers for web page extraction suffer from the requirement of huge amount of labeled training pages to obtain satisfying results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manually labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.","PeriodicalId":210911,"journal":{"name":"2009 IEEE International Conference on Intelligence and Security Informatics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Facilitating wrapper generation with page analysis\",\"authors\":\"Bo Wu, Xueqi Cheng, Yu Wang, Gang Zhang, Guodong Ding\",\"doi\":\"10.1109/ISI.2009.5137299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current approaches for generating wrappers for web page extraction suffer from the requirement of huge amount of labeled training pages to obtain satisfying results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manually labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.\",\"PeriodicalId\":210911,\"journal\":{\"name\":\"2009 IEEE International Conference on Intelligence and Security Informatics\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Conference on Intelligence and Security Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISI.2009.5137299\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Conference on Intelligence and Security Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI.2009.5137299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Facilitating wrapper generation with page analysis
Current approaches for generating wrappers for web page extraction suffer from the requirement of huge amount of labeled training pages to obtain satisfying results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manually labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.