使用多启发式注释器自动标记隐藏的web数据

Umamageswari Baskaran, Kalpana Ramanujam
{"title":"使用多启发式注释器自动标记隐藏的web数据","authors":"Umamageswari Baskaran,&nbsp;Kalpana Ramanujam","doi":"10.1016/j.fcij.2018.11.004","DOIUrl":null,"url":null,"abstract":"<div><p>Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.</p></div>","PeriodicalId":100561,"journal":{"name":"Future Computing and Informatics Journal","volume":"3 2","pages":"Pages 417-423"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.fcij.2018.11.004","citationCount":"0","resultStr":"{\"title\":\"Automatic labeling of hidden web data using Multi-Heuristics Annotator\",\"authors\":\"Umamageswari Baskaran,&nbsp;Kalpana Ramanujam\",\"doi\":\"10.1016/j.fcij.2018.11.004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.</p></div>\",\"PeriodicalId\":100561,\"journal\":{\"name\":\"Future Computing and Informatics Journal\",\"volume\":\"3 2\",\"pages\":\"Pages 417-423\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/j.fcij.2018.11.004\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Computing and Informatics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2314728818300394\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Computing and Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2314728818300394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

隐藏网络包含大量高质量的数据,这些数据没有被搜索引擎索引。隐藏网页是指通过在服务器端模板中嵌入匹配搜索关键字的后端数据来动态生成的网页。它们是为人类使用而创建的,由于结构化数据嵌入在非结构化HTML标记中,因此使自动化处理变得麻烦。为了实现机器处理,必须检测、提取和注释结构化数据。文献中有许多基于启发式的方法DeLa[1]、MSAA[2]来执行自动标注。如果数据值不包含作为属性值本身的一部分呈现的标签,或者作为表单接口或查询响应页面的一部分不显式地可用,那么大多数这些技术都会失败。提出的技术通过从属于感兴趣的业务领域的多个网站收集领域关键字,然后以语义规则的形式捕获模式来解决这个问题。实验结果表明,单一的启发式方法不足以标记所有的数据值组。根据它们分配最合适标签的能力,一个接一个地应用注释器。实验结果表明,与现有标注技术相比,该方法提高了标注精度和查全率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Automatic labeling of hidden web data using Multi-Heuristics Annotator

Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Relationship between E-CRM, Service Quality, Customer Satisfaction, Trust, and Loyalty in banking Industry Enhancing query processing on stock market cloud-based database Crow search algorithm with time varying flight length Strategies for feature selection A Framework to Enhance the International Competitive Advantage of Information Technology Graduates A Literature Review on Agile Methodologies Quality, eXtreme Programming and SCRUM
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1