Krill: KorAP search and analysis engine

Nils Diewald, Eliza Margaretha
{"title":"Krill: KorAP search and analysis engine","authors":"Nils Diewald, Eliza Margaretha","doi":"10.21248/jlcl.31.2016.202","DOIUrl":null,"url":null,"abstract":"KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.31.2016.202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Krill: KorAP搜索和分析引擎
KorAP1 (korpusanalyseplatform)是一个语料库搜索和分析平台,用于处理具有多个注释层、多种查询语言和复杂许可模型的超大型语料库(Bański et al., 2013)。它旨在接替COSMAS II系统(Bodmer, 1996),提供德语参考语料库DEREKO (Kupietz和l ngen, 2014),由德语研究所(IDS)主办该语料库包括广泛的文本,如小说、报纸文章和脚本演讲,并在多个语言层面进行了注释,例如词性和句法依赖结构。据报道,在2016年9月,它包含了大约300亿个单词,并且仍在不断增长。Krill3(使用Lucene进行查找的语料库数据检索索引)是一个语料库搜索引擎,作为KorAP中的搜索组件。它基于Apache Lucene,这是一个流行且完善的信息检索引擎。Lucene的轻量级内存需求和可伸缩索引适合处理大小快速增长的大型语料库。它支持多种查询类型的全文搜索,包括短语和通配符查询,并允许自定义实现来处理复杂的语言查询。在本文中,我们描述了Krill及其索引如何设计来处理全文和复杂的注释搜索,结合不同的注释层和非常大的语料库的来源。本文的结构如下。第2节描述了在KorAP中搜索是如何工作的(从接收搜索请求开始,直到返回搜索结果)。第3节解释了语料库数据如何在Krill中表示和索引。第4节描述了Krill处理的各种查询,以及如何处理它们以在索引上进行实际搜索。第5节描述了包含搜索结果的Krill响应格式。我们在第6节和第7节分别介绍了相关的和进一步的工作。论文以摘要结尾。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Aufbau eines Referenzkorpus zur deutschsprachigen internetbasierten Kommunikation als Zusatzkomponente für die Korpora im Projekt 'Digitales Wörterbuch der deutschen Sprache' (DWDS) Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1