{"title":"DeWild: a tool for searching the web using wild cards","authors":"Haobin Li, Davood Rafiei","doi":"10.1145/1148170.1148344","DOIUrl":null,"url":null,"abstract":"A large volume of facts are available on the Web and manually extracting these facts is time consuming and often impractical. Example extraction tasks include compiling a list of scientists, a list of a company’s acquisitions, etc. Unless such lists have already been compiled and made available on the Web, one has to query a search engine, examine the pages returned, and extract a handful of instances from each page. Consider the case of extracting researchers; many bona fide names are not referred to as researchers. Instead, they are often coined as scientists, experts, professors, etc. If only the term “researchers” is used in the query, many qualified instances will not be extracted. We demonstrate DeWild, a domain independent system for searching and data extraction on the Web. A search in DeWild is expressed using a simple query with some wild cards, and the result of a query is a ranked list of rows that match the wilds cards. For instance, given the query “Oracle acquired %”, the output is expected to be a ranked list of companies that were purchased by Oracle, preferably the real Oracle acquisitions ranked the highest. One type of wild card in DeWild is an extractor. An extractor is used to indicate a probable position of desired data that needs to be extracted. Another type of wild card, used for query relaxation, can indicate terms that are semantically similar to the given one should also be considered. For instance, the wild card can specify that words similar to “researchers” (e.g. scientists) should be part of the search. Building a unified query interface for a large number of extraction tasks is challenging. A problem with phrase queries, especially long ones, is that they can retrieve very few or no matches. Query relaxation techniques (e.g. [2]) are not generally applicable to phrase queries. DeWild uses an online","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1148170.1148344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
A large volume of facts are available on the Web and manually extracting these facts is time consuming and often impractical. Example extraction tasks include compiling a list of scientists, a list of a company’s acquisitions, etc. Unless such lists have already been compiled and made available on the Web, one has to query a search engine, examine the pages returned, and extract a handful of instances from each page. Consider the case of extracting researchers; many bona fide names are not referred to as researchers. Instead, they are often coined as scientists, experts, professors, etc. If only the term “researchers” is used in the query, many qualified instances will not be extracted. We demonstrate DeWild, a domain independent system for searching and data extraction on the Web. A search in DeWild is expressed using a simple query with some wild cards, and the result of a query is a ranked list of rows that match the wilds cards. For instance, given the query “Oracle acquired %”, the output is expected to be a ranked list of companies that were purchased by Oracle, preferably the real Oracle acquisitions ranked the highest. One type of wild card in DeWild is an extractor. An extractor is used to indicate a probable position of desired data that needs to be extracted. Another type of wild card, used for query relaxation, can indicate terms that are semantically similar to the given one should also be considered. For instance, the wild card can specify that words similar to “researchers” (e.g. scientists) should be part of the search. Building a unified query interface for a large number of extraction tasks is challenging. A problem with phrase queries, especially long ones, is that they can retrieve very few or no matches. Query relaxation techniques (e.g. [2]) are not generally applicable to phrase queries. DeWild uses an online