graphANNIS:深度标注语料库的快速查询引擎

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI:10.21248/jlcl.31.2016.199

Thomas Krause, U. Leser, Anke Lüdeling

{"title":"graphANNIS:深度标注语料库的快速查询引擎","authors":"Thomas Krause, U. Leser, Anke Lüdeling","doi":"10.21248/jlcl.31.2016.199","DOIUrl":null,"url":null,"abstract":"We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of ﬁnding paths in a graph and present eﬃcient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora\",\"authors\":\"Thomas Krause, U. Leser, Anke Lüdeling\",\"doi\":\"10.21248/jlcl.31.2016.199\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of ﬁnding paths in a graph and present eﬃcient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.\",\"PeriodicalId\":402489,\"journal\":{\"name\":\"J. Lang. Technol. Comput. Linguistics\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Lang. Technol. Comput. Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21248/jlcl.31.2016.199\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.31.2016.199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

我们提出graphANNIS，一个快速实现的建立查询语言AQL处理深度注释的语言语料库。AQL构建在基于图的抽象上，用于建模和交换语言数据，但其当前的所有实现都使用关系数据库作为存储层。相反，graphANNIS直接在主存中实现ANNIS图形数据模型。我们展示了绝大多数AQL功能可以映射到在图中查找路径的基本操作，并为该操作和所有其他所需操作提供了有效的实现和索引结构。我们将graphANNIS的性能与标准的基于sql的AQL实现进行比较，使用超过3000个真实查询的工作负载，在17个开放语料库上，每个语料库的大小高达300万个令牌，其注释范围从简单的线性词性标记到深度嵌套的话语结构。对于整个工作负载，graphANNIS要快40倍以上，在不到3%的查询中要慢一些。graphANNIS以及用于评估的工作负载和语料库都可以在GitHub和Zenodo开放存取档案中免费获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora

We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of ﬁnding paths in a graph and present eﬃcient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Lang. Technol. Comput. Linguistics

自引率

0.00%

发文量