Web Content Extraction by Weighing the Fundamental Contextual Rules

Mahdi Mohammadi, M. Shayegan, Nima Latifi
{"title":"Web Content Extraction by Weighing the Fundamental Contextual Rules","authors":"Mahdi Mohammadi, M. Shayegan, Nima Latifi","doi":"10.1109/ICSPIS54653.2021.9729342","DOIUrl":null,"url":null,"abstract":"Nowadays, data access, data sharing, data extraction and data usage have become a vital issue for technology experts. With the rapid growth of content on the Web, humans need new and up-to-date approaches for data extraction from the Web. However, there is much useless and unrelated information such as navigation panel, content table, propaganda, service catalogue, and menus in these pages. Thus, the web content is considered useful (original) and useless (secondary) content. Most receivers and final users search for useful content. This research presents a new approach to extract useful content from the Web. For this purpose, child nodes are selected as the original content by weighing the fundamental contextual rules method to DOM Tree's nodes. Overall, after standardizing web page and developing DOM Tree, the best child node of the parent node are selected according to a weighing algorithm; then, the best path and the best sample node are selected. The presented solution applied on several datasets shows high accuracy rate such as Precision, Recall and F factor are 0.992, 0.983 and 0.988, respectively.","PeriodicalId":286966,"journal":{"name":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPIS54653.2021.9729342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Nowadays, data access, data sharing, data extraction and data usage have become a vital issue for technology experts. With the rapid growth of content on the Web, humans need new and up-to-date approaches for data extraction from the Web. However, there is much useless and unrelated information such as navigation panel, content table, propaganda, service catalogue, and menus in these pages. Thus, the web content is considered useful (original) and useless (secondary) content. Most receivers and final users search for useful content. This research presents a new approach to extract useful content from the Web. For this purpose, child nodes are selected as the original content by weighing the fundamental contextual rules method to DOM Tree's nodes. Overall, after standardizing web page and developing DOM Tree, the best child node of the parent node are selected according to a weighing algorithm; then, the best path and the best sample node are selected. The presented solution applied on several datasets shows high accuracy rate such as Precision, Recall and F factor are 0.992, 0.983 and 0.988, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
权衡基本上下文规则的Web内容提取
如今,数据访问、数据共享、数据提取和数据使用已经成为技术专家面临的重要问题。随着Web上内容的快速增长,人们需要新的和最新的方法来从Web中提取数据。但是,这些页面中有很多无用和不相关的信息,如导航面板、内容表、宣传、服务目录、菜单等。因此,网络内容被认为是有用的(原始的)和无用的(次要的)内容。大多数接收者和最终用户搜索有用的内容。本研究提出了一种从网络中提取有用内容的新方法。为此,通过将基本上下文规则方法与DOM Tree的节点进行权衡,选择子节点作为原始内容。总体而言,在对网页进行标准化和开发DOM树后,根据权重算法选择父节点的最佳子节点;然后,选择最佳路径和最佳样本节点。该方法在多个数据集上的应用表明,准确率(Precision)、召回率(Recall)和F因子(F factor)分别达到0.992、0.983和0.988。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Intelligent Fault Diagnosis of Rolling BearingBased on Deep Transfer Learning Using Time-Frequency Representation Wind Energy Potential Approximation with Various Metaheuristic Optimization Techniques Deployment Listening to Sounds of Silence for Audio replay attack detection Transcranial Magnetic Stimulation of Prefrontal Cortex Alters Functional Brain Network Architecture: Graph Theoretical Analysis Anomaly Detection and Resilience-Oriented Countermeasures against Cyberattacks in Smart Grids
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1