基于DOM树和统计信息的Web内容信息提取

Xin Yu, Z. Jin
{"title":"基于DOM树和统计信息的Web内容信息提取","authors":"Xin Yu, Z. Jin","doi":"10.1109/ICCT.2017.8359846","DOIUrl":null,"url":null,"abstract":"Booming web pages contain a lot of information, while they contain little content and much unrelated noise information, such as script code, links, advertising and so on. These unrelated noise information occupies a lot of space, which is not suitable for the transition to small mobile devices, data mining and information retrieval. Therefore, web information extraction technology becomes more and more important. However, most extraction methods cannot adapt various and heterogeneous web structure and have poor generality and extracting efficiency. In this paper, we propose a method which can adapt to the heterogeneity and variability of web pages and gets high precision and recall. Our method is based on DOM structure to divide one web page into several blocks, and extract content blocks with statistical information instead of machine learning repeating training and manual labeling, which gets a good performance in Precision, Recall and F1.","PeriodicalId":199874,"journal":{"name":"2017 IEEE 17th International Conference on Communication Technology (ICCT)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Web content information extraction based on DOM tree and statistical information\",\"authors\":\"Xin Yu, Z. Jin\",\"doi\":\"10.1109/ICCT.2017.8359846\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Booming web pages contain a lot of information, while they contain little content and much unrelated noise information, such as script code, links, advertising and so on. These unrelated noise information occupies a lot of space, which is not suitable for the transition to small mobile devices, data mining and information retrieval. Therefore, web information extraction technology becomes more and more important. However, most extraction methods cannot adapt various and heterogeneous web structure and have poor generality and extracting efficiency. In this paper, we propose a method which can adapt to the heterogeneity and variability of web pages and gets high precision and recall. Our method is based on DOM structure to divide one web page into several blocks, and extract content blocks with statistical information instead of machine learning repeating training and manual labeling, which gets a good performance in Precision, Recall and F1.\",\"PeriodicalId\":199874,\"journal\":{\"name\":\"2017 IEEE 17th International Conference on Communication Technology (ICCT)\",\"volume\":\"86 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 17th International Conference on Communication Technology (ICCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCT.2017.8359846\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 17th International Conference on Communication Technology (ICCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCT.2017.8359846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

蓬勃发展的网页包含大量的信息,而他们包含的内容很少,很多无关的噪音信息,如脚本代码,链接,广告等。这些不相关的噪声信息占用了大量的空间,不适合过渡到小型移动设备、数据挖掘和信息检索。因此,网络信息提取技术变得越来越重要。然而,大多数提取方法不能适应网络结构的多样性和异构性,通用性差,提取效率低。在本文中,我们提出了一种能够适应网页的异质性和可变性,并获得较高的准确率和召回率的方法。我们的方法是基于DOM结构将一个网页划分为多个块,用统计信息提取内容块,而不是机器学习重复训练和人工标注,在Precision、Recall和F1方面都取得了很好的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Web content information extraction based on DOM tree and statistical information
Booming web pages contain a lot of information, while they contain little content and much unrelated noise information, such as script code, links, advertising and so on. These unrelated noise information occupies a lot of space, which is not suitable for the transition to small mobile devices, data mining and information retrieval. Therefore, web information extraction technology becomes more and more important. However, most extraction methods cannot adapt various and heterogeneous web structure and have poor generality and extracting efficiency. In this paper, we propose a method which can adapt to the heterogeneity and variability of web pages and gets high precision and recall. Our method is based on DOM structure to divide one web page into several blocks, and extract content blocks with statistical information instead of machine learning repeating training and manual labeling, which gets a good performance in Precision, Recall and F1.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Chemical substance classification using long short-term memory recurrent neural network One-way time transfer for large area through tropospheric scatter Application feature extraction by using both dynamic binary tracking and statistical learning Research on multi-target resolution process with the same beam of monopulse radar Pedestrian detection based on Visconti2 7502
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1