S. UjwalB.V., Bharat Gaind, Abhishek Kundu, Anusha K. Holla, Mukund Rungta
{"title":"Classification-Based Adaptive Web Scraper","authors":"S. UjwalB.V., Bharat Gaind, Abhishek Kundu, Anusha K. Holla, Mukund Rungta","doi":"10.1109/ICMLA.2017.0-168","DOIUrl":null,"url":null,"abstract":"Web scraping is an important problem in computer science. The problem with the commonly-used position or structure-based web scraping tools is that they need to be manually reconfigured as soon as the structure of the web page changes. In this paper, we try to solve this problem of information extraction for web pages consisting of repetitive blocks. We extract these blocks and their constituent attributes, using a novel classification-based approach. Our approach gives high accuracy when used to extract product-offers from an offer-aggregator website. It is also highly adaptive to the changing structure of a website.","PeriodicalId":6636,"journal":{"name":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"34 1","pages":"125-132"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2017.0-168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Web scraping is an important problem in computer science. The problem with the commonly-used position or structure-based web scraping tools is that they need to be manually reconfigured as soon as the structure of the web page changes. In this paper, we try to solve this problem of information extraction for web pages consisting of repetitive blocks. We extract these blocks and their constituent attributes, using a novel classification-based approach. Our approach gives high accuracy when used to extract product-offers from an offer-aggregator website. It is also highly adaptive to the changing structure of a website.