Context Retrieval for Web Tables

Proceedings of the 2015 International Conference on The Theory of Information Retrieval Pub Date : 2015-09-27 DOI:10.1145/2808194.2809453

Hong Wang, Anqi Liu, Jing Wang, Brian D. Ziebart, Clement T. Yu, Warren Shen

{"title":"Context Retrieval for Web Tables","authors":"Hong Wang, Anqi Liu, Jing Wang, Brian D. Ziebart, Clement T. Yu, Warren Shen","doi":"10.1145/2808194.2809453","DOIUrl":null,"url":null,"abstract":"Many modern knowledge bases are built by extracting information from millions of web pages. Though existing extraction methods primarily focus on web pages' main text, a huge amount of information is embedded within other web structures, such as web tables. Previous studies have shown that linking web page tables and textual context is beneficial for extracting more information from web pages. However, using the text surrounding each table without carefully assessing its relevance introduces noise in the extracted information, degrading its accuracy. To the best of our knowledge, we provide the first systematic study of the problem of table-related context retrieval: given a table and the sentences within the same web page, determine for each sentence whether it is relevant to the table. We define the concept of relevance and introduce a Table-Related Context Retrieval system (TRCR) in this paper. We experiment with different machine learning algorithms, including a recently developed algorithm that is robust to biases in the training data, and show that our system retrieves table-related context with F1=0.735.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808194.2809453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Many modern knowledge bases are built by extracting information from millions of web pages. Though existing extraction methods primarily focus on web pages' main text, a huge amount of information is embedded within other web structures, such as web tables. Previous studies have shown that linking web page tables and textual context is beneficial for extracting more information from web pages. However, using the text surrounding each table without carefully assessing its relevance introduces noise in the extracted information, degrading its accuracy. To the best of our knowledge, we provide the first systematic study of the problem of table-related context retrieval: given a table and the sentences within the same web page, determine for each sentence whether it is relevant to the table. We define the concept of relevance and introduce a Table-Related Context Retrieval system (TRCR) in this paper. We experiment with different machine learning algorithms, including a recently developed algorithm that is robust to biases in the training data, and show that our system retrieves table-related context with F1=0.735.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Web表的上下文检索

许多现代知识库是通过从数以百万计的网页中提取信息而建立起来的。虽然现有的提取方法主要集中在网页的主要文本上，但是大量的信息被嵌入到其他的网页结构中，比如网页表。以往的研究表明，链接网页表和文本上下文有利于从网页中提取更多的信息。然而，在没有仔细评估其相关性的情况下使用每个表周围的文本会在提取的信息中引入噪声，从而降低其准确性。据我们所知，我们提供了第一个与表相关的上下文检索问题的系统研究:给定一个表和同一网页中的句子，确定每个句子是否与表相关。本文定义了关联的概念，并介绍了一个表相关上下文检索系统(TRCR)。我们实验了不同的机器学习算法，包括最近开发的一种对训练数据中的偏差具有鲁棒性的算法，并表明我们的系统检索F1=0.735的表相关上下文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 International Conference on The Theory of Information Retrieval

自引率

0.00%

发文量

期刊最新文献

Entity Linking in Queries: Tasks and Evaluation Using Part-of-Speech N-grams for Sensitive-Text Classification Query Expansion with Freebase Partially Labeled Supervised Topic Models for RetrievingSimilar Questions in CQA Forums Two Operators to Define and Manipulate Themes of a Document Collection