Scraping Relevant Images from Web Pages Without Download

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on the Web Pub Date : 2023-08-19 DOI:10.1145/3616849

Erdinç Uzun

{"title":"Scraping Relevant Images from Web Pages Without Download","authors":"Erdinç Uzun","doi":"10.1145/3616849","DOIUrl":null,"url":null,"abstract":"Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools because it is based on textual data.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3616849","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

Abstract

Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools because it is based on textual data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在未下载的情况下从网页中删除相关图像

自动从网页中抓取相关图像是一项容易出错且耗时的任务，这导致专家更喜欢手动为网站准备提取模式。现有的web抓取工具就是建立在这些模式之上的。然而，这种手动方法很费力，需要专业知识。自动提取方法虽然是一种潜在的解决方案，但需要大量的训练数据集和大量的特征，包括宽度、高度、像素和文件大小，这些特征可能很难获得，也很耗时。为了应对这些挑战，我们提出了一种半自动方法，该方法不需要专家，利用小型训练数据集，错误率低，同时节省时间和存储。我们的方法包括对网站上的网页进行聚类，并为非专家建议几个页面来注释相关图像。然后，该方法使用这些注释来基于HTML元素的文本数据构建学习模型。在实验中，我们使用了来自200个新闻网站的635015张图像数据集，每个网站包含100个页面，其中22632张相关图像。当比较自动方法和我们提出的方法的几种机器学习方法时，AdaBoost方法产生了最佳的性能结果。当使用自动提取方法时，使用由120个网站（12000个网页）组成的大型训练数据集构建的学习模型，可以实现的最佳f-Measure为0.805。相比之下，我们的方法在200个网站上实现了0.958的平均f-Measure，每个网站只有6个网页注释。这意味着非专家只需要检查1200个网页，就可以确定200个网站的相关图像。我们的方法还通过不需要下载图像来节省时间和存储空间，并且可以很容易地集成到当前可用的网络抓取工具中，因为它是基于文本数据的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on the Web 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

7.5 months

期刊介绍： Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.