使用文本锚的Web数据提取

2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI) Pub Date : 2015-11-01 DOI:10.1109/KBEI.2015.7436204

Ahmad Pouramini, Sh. Nasiri

{"title":"使用文本锚的Web数据提取","authors":"Ahmad Pouramini, Sh. Nasiri","doi":"10.1109/KBEI.2015.7436204","DOIUrl":null,"url":null,"abstract":"In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.","PeriodicalId":168295,"journal":{"name":"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Web data extraction using textual anchors\",\"authors\":\"Ahmad Pouramini, Sh. Nasiri\",\"doi\":\"10.1109/KBEI.2015.7436204\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.\",\"PeriodicalId\":168295,\"journal\":{\"name\":\"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KBEI.2015.7436204\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KBEI.2015.7436204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在本文中，我们提出了一种方法和一个可视化工具，称为ABDES，用于创建web包装器来从网页中提取数据记录。在我们的方法中，我们主要依赖于可见的页面内容，模拟人类用户扫描网页以获取特定数据的方式。为了创建包装器，我们使用文本特性，如文本分隔符、关键字、常量或文本模式(我们称之为锚)来为目标数据区域和数据记录创建模式。我们提供了一种多项式数据提取算法，在该算法中，这些模式会在DOM树的混合自底向上和自顶向下遍历中根据页面元素进行检查。提取的数据直接映射到分层XML结构，作为算法的输出。系统生成的包装器健壮且独立于HTML结构。因此，它们可以适应多个网站来收集和整合信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Web data extraction using textual anchors

In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)

自引率

0.00%

发文量