Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites

W. Thanadechteemapat, L. Fung
{"title":"Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites","authors":"W. Thanadechteemapat, L. Fung","doi":"10.1109/ICMLC.2012.6359546","DOIUrl":null,"url":null,"abstract":"Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.","PeriodicalId":128006,"journal":{"name":"2012 International Conference on Machine Learning and Cybernetics","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on Machine Learning and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLC.2012.6359546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过扩展一种新颖的单页提取方法改进网页内容提取:泰国网站的案例研究
本文提出了一种Web内容抽取技术。该技术能够基于启发式规则处理单个和多个页面。在多页面提取中,提出了一种提取内容匹配(ECM)技术来识别提取结果中的噪声。为了减少处理时间,还介绍了该技术中的一些特性,如使用XPath、文件压缩和并行处理。性能的评估是基于精度,召回率和f测量使用提取内容的长度。通过与人工提取方法的初步结果比较,取得了较好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ROBUST H∞ filtering for a class of nonlinear uncertain singular systems with time-varying delay Discriminati on between external short circuit and internal winding fault in power transformer using discrete wavelet transform and back-propagation neural network Hybrid linear and nonlinear weight Particle Swarm Optimization algorithm Transcriptional cooperativity in molecular dynamics based on normal mode analysis An efficient web document clustering algorithm for building dynamic similarity profile in Similarity-aware web caching
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1