DeWild: a tool for searching the web using wild cards

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148344

Haobin Li, Davood Rafiei

{"title":"DeWild: a tool for searching the web using wild cards","authors":"Haobin Li, Davood Rafiei","doi":"10.1145/1148170.1148344","DOIUrl":null,"url":null,"abstract":"A large volume of facts are available on the Web and manually extracting these facts is time consuming and often impractical. Example extraction tasks include compiling a list of scientists, a list of a company’s acquisitions, etc. Unless such lists have already been compiled and made available on the Web, one has to query a search engine, examine the pages returned, and extract a handful of instances from each page. Consider the case of extracting researchers; many bona fide names are not referred to as researchers. Instead, they are often coined as scientists, experts, professors, etc. If only the term “researchers” is used in the query, many qualified instances will not be extracted. We demonstrate DeWild, a domain independent system for searching and data extraction on the Web. A search in DeWild is expressed using a simple query with some wild cards, and the result of a query is a ranked list of rows that match the wilds cards. For instance, given the query “Oracle acquired %”, the output is expected to be a ranked list of companies that were purchased by Oracle, preferably the real Oracle acquisitions ranked the highest. One type of wild card in DeWild is an extractor. An extractor is used to indicate a probable position of desired data that needs to be extracted. Another type of wild card, used for query relaxation, can indicate terms that are semantically similar to the given one should also be considered. For instance, the wild card can specify that words similar to “researchers” (e.g. scientists) should be part of the search. Building a unified query interface for a large number of extraction tasks is challenging. A problem with phrase queries, especially long ones, is that they can retrieve very few or no matches. Query relaxation techniques (e.g. [2]) are not generally applicable to phrase queries. DeWild uses an online","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1148170.1148344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

A large volume of facts are available on the Web and manually extracting these facts is time consuming and often impractical. Example extraction tasks include compiling a list of scientists, a list of a company’s acquisitions, etc. Unless such lists have already been compiled and made available on the Web, one has to query a search engine, examine the pages returned, and extract a handful of instances from each page. Consider the case of extracting researchers; many bona fide names are not referred to as researchers. Instead, they are often coined as scientists, experts, professors, etc. If only the term “researchers” is used in the query, many qualified instances will not be extracted. We demonstrate DeWild, a domain independent system for searching and data extraction on the Web. A search in DeWild is expressed using a simple query with some wild cards, and the result of a query is a ranked list of rows that match the wilds cards. For instance, given the query “Oracle acquired %”, the output is expected to be a ranked list of companies that were purchased by Oracle, preferably the real Oracle acquisitions ranked the highest. One type of wild card in DeWild is an extractor. An extractor is used to indicate a probable position of desired data that needs to be extracted. Another type of wild card, used for query relaxation, can indicate terms that are semantically similar to the given one should also be considered. For instance, the wild card can specify that words similar to “researchers” (e.g. scientists) should be part of the search. Building a unified query interface for a large number of extraction tasks is challenging. A problem with phrase queries, especially long ones, is that they can retrieve very few or no matches. Query relaxation techniques (e.g. [2]) are not generally applicable to phrase queries. DeWild uses an online

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeWild:一个使用通配符搜索网络的工具

Web上有大量的事实，手动提取这些事实既耗时又不切实际。示例提取任务包括编译科学家列表、公司收购列表等。除非已经编译了这样的列表并在Web上提供，否则必须查询搜索引擎，检查返回的页面，并从每个页面提取少量实例。以抽取研究人员为例;许多真正的名字不被称为研究人员。相反，他们经常被杜撰为科学家、专家、教授等。如果在查询中只使用术语“研究人员”，则不会提取许多符合条件的实例。我们演示了DeWild，一个独立于领域的Web搜索和数据提取系统。DeWild中的搜索使用带有一些通配符的简单查询来表示，查询的结果是与通配符匹配的行的排序列表。例如，给定查询“Oracle收购%”，预期输出将是被Oracle收购的公司的排名列表，最好是真正的Oracle收购排名最高。DeWild中的一种通配符是提取器。提取器用于指示需要提取的所需数据的可能位置。用于查询松弛的另一种通配符可以指示还应该考虑与给定的术语在语义上相似的术语。例如，通配符可以指定与“研究人员”(例如科学家)相似的单词应该是搜索的一部分。为大量的提取任务构建统一的查询接口是一项挑战。短语查询(尤其是长短语)的一个问题是，它们只能检索到很少的匹配项，或者根本检索不到匹配项。查询松弛技术(例如[2])通常不适用于短语查询。德怀尔德使用了一个在线

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量

期刊最新文献

Strict and vague interpretation of XML-retrieval queries AggregateRank: bringing order to web sites Text clustering with extended user feedback Improving personalized web search using result diversification High accuracy retrieval with multiple nested ranker