Krill: KorAP search and analysis engine

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI:10.21248/jlcl.31.2016.202

Nils Diewald, Eliza Margaretha

{"title":"Krill: KorAP search and analysis engine","authors":"Nils Diewald, Eliza Margaretha","doi":"10.21248/jlcl.31.2016.202","DOIUrl":null,"url":null,"abstract":"KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.31.2016.202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Krill: KorAP搜索和分析引擎

KorAP1 (korpusanalyseplatform)是一个语料库搜索和分析平台，用于处理具有多个注释层、多种查询语言和复杂许可模型的超大型语料库(Bański et al.， 2013)。它旨在接替COSMAS II系统(Bodmer, 1996)，提供德语参考语料库DEREKO (Kupietz和l ngen, 2014)，由德语研究所(IDS)主办该语料库包括广泛的文本，如小说、报纸文章和脚本演讲，并在多个语言层面进行了注释，例如词性和句法依赖结构。据报道，在2016年9月，它包含了大约300亿个单词，并且仍在不断增长。Krill3(使用Lucene进行查找的语料库数据检索索引)是一个语料库搜索引擎，作为KorAP中的搜索组件。它基于Apache Lucene，这是一个流行且完善的信息检索引擎。Lucene的轻量级内存需求和可伸缩索引适合处理大小快速增长的大型语料库。它支持多种查询类型的全文搜索，包括短语和通配符查询，并允许自定义实现来处理复杂的语言查询。在本文中，我们描述了Krill及其索引如何设计来处理全文和复杂的注释搜索，结合不同的注释层和非常大的语料库的来源。本文的结构如下。第2节描述了在KorAP中搜索是如何工作的(从接收搜索请求开始，直到返回搜索结果)。第3节解释了语料库数据如何在Krill中表示和索引。第4节描述了Krill处理的各种查询，以及如何处理它们以在索引上进行实际搜索。第5节描述了包含搜索结果的Krill响应格式。我们在第6节和第7节分别介绍了相关的和进一步的工作。论文以摘要结尾。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

J. Lang. Technol. Comput. Linguistics

自引率

0.00%

发文量