{"title":"基于DOM树和统计信息的Web内容信息提取","authors":"Xin Yu, Z. Jin","doi":"10.1109/ICCT.2017.8359846","DOIUrl":null,"url":null,"abstract":"Booming web pages contain a lot of information, while they contain little content and much unrelated noise information, such as script code, links, advertising and so on. These unrelated noise information occupies a lot of space, which is not suitable for the transition to small mobile devices, data mining and information retrieval. Therefore, web information extraction technology becomes more and more important. However, most extraction methods cannot adapt various and heterogeneous web structure and have poor generality and extracting efficiency. In this paper, we propose a method which can adapt to the heterogeneity and variability of web pages and gets high precision and recall. Our method is based on DOM structure to divide one web page into several blocks, and extract content blocks with statistical information instead of machine learning repeating training and manual labeling, which gets a good performance in Precision, Recall and F1.","PeriodicalId":199874,"journal":{"name":"2017 IEEE 17th International Conference on Communication Technology (ICCT)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Web content information extraction based on DOM tree and statistical information\",\"authors\":\"Xin Yu, Z. Jin\",\"doi\":\"10.1109/ICCT.2017.8359846\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Booming web pages contain a lot of information, while they contain little content and much unrelated noise information, such as script code, links, advertising and so on. These unrelated noise information occupies a lot of space, which is not suitable for the transition to small mobile devices, data mining and information retrieval. Therefore, web information extraction technology becomes more and more important. However, most extraction methods cannot adapt various and heterogeneous web structure and have poor generality and extracting efficiency. In this paper, we propose a method which can adapt to the heterogeneity and variability of web pages and gets high precision and recall. Our method is based on DOM structure to divide one web page into several blocks, and extract content blocks with statistical information instead of machine learning repeating training and manual labeling, which gets a good performance in Precision, Recall and F1.\",\"PeriodicalId\":199874,\"journal\":{\"name\":\"2017 IEEE 17th International Conference on Communication Technology (ICCT)\",\"volume\":\"86 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 17th International Conference on Communication Technology (ICCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCT.2017.8359846\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 17th International Conference on Communication Technology (ICCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCT.2017.8359846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Web content information extraction based on DOM tree and statistical information
Booming web pages contain a lot of information, while they contain little content and much unrelated noise information, such as script code, links, advertising and so on. These unrelated noise information occupies a lot of space, which is not suitable for the transition to small mobile devices, data mining and information retrieval. Therefore, web information extraction technology becomes more and more important. However, most extraction methods cannot adapt various and heterogeneous web structure and have poor generality and extracting efficiency. In this paper, we propose a method which can adapt to the heterogeneity and variability of web pages and gets high precision and recall. Our method is based on DOM structure to divide one web page into several blocks, and extract content blocks with statistical information instead of machine learning repeating training and manual labeling, which gets a good performance in Precision, Recall and F1.