一种独立于模板的网络新闻和博客内容提取方法

Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li
{"title":"一种独立于模板的网络新闻和博客内容提取方法","authors":"Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li","doi":"10.1109/ICISCE.2016.36","DOIUrl":null,"url":null,"abstract":"The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.","PeriodicalId":6882,"journal":{"name":"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Template Independent Approach for Web News and Blog Content Extraction\",\"authors\":\"Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li\",\"doi\":\"10.1109/ICISCE.2016.36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.\",\"PeriodicalId\":6882,\"journal\":{\"name\":\"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICISCE.2016.36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISCE.2016.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

网络已经成为信息发布和消费的大平台。网络新闻和博客都是具有代表性的信息来源,提供了方便的获取信息的方式。除了主要内容外,大多数网页还包含导航面板、广告、推荐文章等。有效地提取新闻和博客内容并过滤这些噪音是必要的,也是具有挑战性的。在本文中,我们提出了一种可移植到不同语言和不同领域的新闻和博客内容提取方法。我们广泛的案例研究表明,那些不是锚文本但包含停顿词的字符更有可能是真正的内容。我们的方法首先遍历整个DOM树,并对附加到每个DOM节点的有效字符进行计数。然后根据有效字符递归进入最具代表性的子节点。最后,我们在带有预定义标准的主内容节点处停下来。为了验证这一方法,我们从知名的中英文网站中随机选取了在线新闻和博客文件进行实验。实验结果表明,该方法平均达到96%的f1度量,优于ctr。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Template Independent Approach for Web News and Blog Content Extraction
The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Method for Color Calibration Based on Simulated Annealing Optimization Temperature Analysis in the Fused Deposition Modeling Process Classification of Hyperspectral Image Based on K-Means and Structured Sparse Coding Analysis and Prediction of Epilepsy Based on Visibility Graph Design of Control System for a Rehabilitation Device for Joints of Lower Limbs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1