{"title":"Web Content Extraction by Weighing the Fundamental Contextual Rules","authors":"Mahdi Mohammadi, M. Shayegan, Nima Latifi","doi":"10.1109/ICSPIS54653.2021.9729342","DOIUrl":null,"url":null,"abstract":"Nowadays, data access, data sharing, data extraction and data usage have become a vital issue for technology experts. With the rapid growth of content on the Web, humans need new and up-to-date approaches for data extraction from the Web. However, there is much useless and unrelated information such as navigation panel, content table, propaganda, service catalogue, and menus in these pages. Thus, the web content is considered useful (original) and useless (secondary) content. Most receivers and final users search for useful content. This research presents a new approach to extract useful content from the Web. For this purpose, child nodes are selected as the original content by weighing the fundamental contextual rules method to DOM Tree's nodes. Overall, after standardizing web page and developing DOM Tree, the best child node of the parent node are selected according to a weighing algorithm; then, the best path and the best sample node are selected. The presented solution applied on several datasets shows high accuracy rate such as Precision, Recall and F factor are 0.992, 0.983 and 0.988, respectively.","PeriodicalId":286966,"journal":{"name":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPIS54653.2021.9729342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Nowadays, data access, data sharing, data extraction and data usage have become a vital issue for technology experts. With the rapid growth of content on the Web, humans need new and up-to-date approaches for data extraction from the Web. However, there is much useless and unrelated information such as navigation panel, content table, propaganda, service catalogue, and menus in these pages. Thus, the web content is considered useful (original) and useless (secondary) content. Most receivers and final users search for useful content. This research presents a new approach to extract useful content from the Web. For this purpose, child nodes are selected as the original content by weighing the fundamental contextual rules method to DOM Tree's nodes. Overall, after standardizing web page and developing DOM Tree, the best child node of the parent node are selected according to a weighing algorithm; then, the best path and the best sample node are selected. The presented solution applied on several datasets shows high accuracy rate such as Precision, Recall and F factor are 0.992, 0.983 and 0.988, respectively.