{"title":"Automatic labeling of hidden web data using Multi-Heuristics Annotator","authors":"Umamageswari Baskaran, Kalpana Ramanujam","doi":"10.1016/j.fcij.2018.11.004","DOIUrl":null,"url":null,"abstract":"<div><p>Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.</p></div>","PeriodicalId":100561,"journal":{"name":"Future Computing and Informatics Journal","volume":"3 2","pages":"Pages 417-423"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.fcij.2018.11.004","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Computing and Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2314728818300394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.