使用多启发式注释器自动标记隐藏的web数据

Future Computing and Informatics Journal Pub Date : 2018-12-01 Epub Date: 2018-11-16 DOI:10.1016/j.fcij.2018.11.004

Umamageswari Baskaran, Kalpana Ramanujam

{"title":"使用多启发式注释器自动标记隐藏的web数据","authors":"Umamageswari Baskaran, Kalpana Ramanujam","doi":"10.1016/j.fcij.2018.11.004","DOIUrl":null,"url":null,"abstract":"<div><p>Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.</p></div>","PeriodicalId":100561,"journal":{"name":"Future Computing and Informatics Journal","volume":"3 2","pages":"Pages 417-423"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.fcij.2018.11.004","citationCount":"0","resultStr":"{\"title\":\"Automatic labeling of hidden web data using Multi-Heuristics Annotator\",\"authors\":\"Umamageswari Baskaran, Kalpana Ramanujam\",\"doi\":\"10.1016/j.fcij.2018.11.004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.</p></div>\",\"PeriodicalId\":100561,\"journal\":{\"name\":\"Future Computing and Informatics Journal\",\"volume\":\"3 2\",\"pages\":\"Pages 417-423\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/j.fcij.2018.11.004\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Computing and Informatics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2314728818300394\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2018/11/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Computing and Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2314728818300394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2018/11/16 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

隐藏网络包含大量高质量的数据，这些数据没有被搜索引擎索引。隐藏网页是指通过在服务器端模板中嵌入匹配搜索关键字的后端数据来动态生成的网页。它们是为人类使用而创建的，由于结构化数据嵌入在非结构化HTML标记中，因此使自动化处理变得麻烦。为了实现机器处理，必须检测、提取和注释结构化数据。文献中有许多基于启发式的方法DeLa[1]、MSAA[2]来执行自动标注。如果数据值不包含作为属性值本身的一部分呈现的标签，或者作为表单接口或查询响应页面的一部分不显式地可用，那么大多数这些技术都会失败。提出的技术通过从属于感兴趣的业务领域的多个网站收集领域关键字，然后以语义规则的形式捕获模式来解决这个问题。实验结果表明，单一的启发式方法不足以标记所有的数据值组。根据它们分配最合适标签的能力，一个接一个地应用注释器。实验结果表明，与现有标注技术相比，该方法提高了标注精度和查全率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automatic labeling of hidden web data using Multi-Heuristics Annotator

Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Future Computing and Informatics Journal

自引率

0.00%

发文量