基于本体的信息抽取文档生成系统

Int. J. Semantic Comput. Pub Date : 2020-03-01 DOI:10.1142/s1793351x20400012

D. Lembo, Federico Maria Scafoglieri

{"title":"基于本体的信息抽取文档生成系统","authors":"D. Lembo, Federico Maria Scafoglieri","doi":"10.1142/s1793351x20400012","DOIUrl":null,"url":null,"abstract":"Information Extraction (IE) is the task of automatically organizing in a structured form data extracted from free text documents. In several contexts, it is often desirable that the extracted data are then organized according to an ontology, which provides a formal and conceptual representation of the domain of interest. Ontologies allow for a better data interpretation, as well as for their semantic integration with other information, as in Ontology-based Data Access (OBDA), a popular declarative framework for data management where an ontology is connected to a data layer through mappings. However, the data layer considered so far in OBDA has consisted essentially of relational databases, and how to declaratively couple an ontology with unstructured data sources is still unexplored. By leveraging the recent study on document spanners for rule-based IE by Fagin et al., in this paper, we propose a new framework that allows to map text documents to ontologies, in the spirit of OBDA. We investigate the problem of answering conjunctive queries in this framework. For ontologies specified in the Description Logics [Formula: see text] and [Formula: see text], we show that the problem is polynomial in the size of the underlying documents. We also provide algorithms to solve query answering by rewriting the input query on the basis of the ontology and its mapping toward the source documents. Through these techniques, we pursue a virtual approach, similar to that typically adopted in OBDA, which allows us to answer a query without having to first populate the entire ontology. Interestingly, for [Formula: see text], both the spanners used in the mapping and the one computed by the rewriting algorithm belong to the same class of expressiveness. This holds also for [Formula: see text], modulo some limitations on the form of the mapping. These results say that in these cases our framework can be easily implemented by decoupling ontology management and document access, which can be delegated to an external IE system able to process the extraction rules we use in the mapping.","PeriodicalId":217956,"journal":{"name":"Int. J. Semantic Comput.","volume":"186 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Ontology-based Document Spanning Systems for Information Extraction\",\"authors\":\"D. Lembo, Federico Maria Scafoglieri\",\"doi\":\"10.1142/s1793351x20400012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information Extraction (IE) is the task of automatically organizing in a structured form data extracted from free text documents. In several contexts, it is often desirable that the extracted data are then organized according to an ontology, which provides a formal and conceptual representation of the domain of interest. Ontologies allow for a better data interpretation, as well as for their semantic integration with other information, as in Ontology-based Data Access (OBDA), a popular declarative framework for data management where an ontology is connected to a data layer through mappings. However, the data layer considered so far in OBDA has consisted essentially of relational databases, and how to declaratively couple an ontology with unstructured data sources is still unexplored. By leveraging the recent study on document spanners for rule-based IE by Fagin et al., in this paper, we propose a new framework that allows to map text documents to ontologies, in the spirit of OBDA. We investigate the problem of answering conjunctive queries in this framework. For ontologies specified in the Description Logics [Formula: see text] and [Formula: see text], we show that the problem is polynomial in the size of the underlying documents. We also provide algorithms to solve query answering by rewriting the input query on the basis of the ontology and its mapping toward the source documents. Through these techniques, we pursue a virtual approach, similar to that typically adopted in OBDA, which allows us to answer a query without having to first populate the entire ontology. Interestingly, for [Formula: see text], both the spanners used in the mapping and the one computed by the rewriting algorithm belong to the same class of expressiveness. This holds also for [Formula: see text], modulo some limitations on the form of the mapping. These results say that in these cases our framework can be easily implemented by decoupling ontology management and document access, which can be delegated to an external IE system able to process the extraction rules we use in the mapping.\",\"PeriodicalId\":217956,\"journal\":{\"name\":\"Int. J. Semantic Comput.\",\"volume\":\"186 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Semantic Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s1793351x20400012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Semantic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s1793351x20400012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

信息抽取(Information Extraction, IE)是从自由文本文档中抽取的数据，以结构化的形式自动组织数据。在一些上下文中，通常需要根据本体组织提取的数据，本体提供感兴趣的领域的形式化和概念性表示。本体允许更好的数据解释，以及它们与其他信息的语义集成，如基于本体的数据访问(OBDA)，这是一种流行的数据管理声明性框架，其中本体通过映射连接到数据层。然而，到目前为止，OBDA中考虑的数据层主要由关系数据库组成，如何声明性地将本体与非结构化数据源耦合仍然没有研究。通过利用Fagin等人最近对基于规则的IE的文档生成器的研究，在本文中，我们提出了一个新的框架，允许在OBDA的精神下将文本文档映射到本体。我们研究了在这个框架中回答连词查询的问题。对于描述逻辑[公式:见文本]和[公式:见文本]中指定的本体，我们表明问题是底层文档大小的多项式。我们还提供了通过在本体及其到源文档的映射的基础上重写输入查询来解决查询回答的算法。通过这些技术，我们追求一种虚拟方法，类似于OBDA中通常采用的方法，它允许我们回答查询，而不必首先填充整个本体。有趣的是，对于[公式:参见文本]，映射中使用的生成器和重写算法计算的生成器都属于同一类表达性。这也适用于[公式:见文本]，模取映射形式的一些限制。这些结果表明，在这些情况下，我们的框架可以通过解耦本体管理和文档访问来轻松实现，这可以委托给能够处理我们在映射中使用的提取规则的外部IE系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Ontology-based Document Spanning Systems for Information Extraction

Information Extraction (IE) is the task of automatically organizing in a structured form data extracted from free text documents. In several contexts, it is often desirable that the extracted data are then organized according to an ontology, which provides a formal and conceptual representation of the domain of interest. Ontologies allow for a better data interpretation, as well as for their semantic integration with other information, as in Ontology-based Data Access (OBDA), a popular declarative framework for data management where an ontology is connected to a data layer through mappings. However, the data layer considered so far in OBDA has consisted essentially of relational databases, and how to declaratively couple an ontology with unstructured data sources is still unexplored. By leveraging the recent study on document spanners for rule-based IE by Fagin et al., in this paper, we propose a new framework that allows to map text documents to ontologies, in the spirit of OBDA. We investigate the problem of answering conjunctive queries in this framework. For ontologies specified in the Description Logics [Formula: see text] and [Formula: see text], we show that the problem is polynomial in the size of the underlying documents. We also provide algorithms to solve query answering by rewriting the input query on the basis of the ontology and its mapping toward the source documents. Through these techniques, we pursue a virtual approach, similar to that typically adopted in OBDA, which allows us to answer a query without having to first populate the entire ontology. Interestingly, for [Formula: see text], both the spanners used in the mapping and the one computed by the rewriting algorithm belong to the same class of expressiveness. This holds also for [Formula: see text], modulo some limitations on the form of the mapping. These results say that in these cases our framework can be easily implemented by decoupling ontology management and document access, which can be delegated to an external IE system able to process the extraction rules we use in the mapping.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. J. Semantic Comput.

自引率

0.00%

发文量