首页 > 最新文献

J. Lang. Technol. Comput. Linguistics最新文献

英文 中文
Krill: KorAP search and analysis engine Krill: KorAP搜索和分析引擎
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.202
Nils Diewald, Eliza Margaretha
KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.
KorAP1 (korpusanalyseplatform)是一个语料库搜索和分析平台,用于处理具有多个注释层、多种查询语言和复杂许可模型的超大型语料库(Bański et al., 2013)。它旨在接替COSMAS II系统(Bodmer, 1996),提供德语参考语料库DEREKO (Kupietz和l ngen, 2014),由德语研究所(IDS)主办该语料库包括广泛的文本,如小说、报纸文章和脚本演讲,并在多个语言层面进行了注释,例如词性和句法依赖结构。据报道,在2016年9月,它包含了大约300亿个单词,并且仍在不断增长。Krill3(使用Lucene进行查找的语料库数据检索索引)是一个语料库搜索引擎,作为KorAP中的搜索组件。它基于Apache Lucene,这是一个流行且完善的信息检索引擎。Lucene的轻量级内存需求和可伸缩索引适合处理大小快速增长的大型语料库。它支持多种查询类型的全文搜索,包括短语和通配符查询,并允许自定义实现来处理复杂的语言查询。在本文中,我们描述了Krill及其索引如何设计来处理全文和复杂的注释搜索,结合不同的注释层和非常大的语料库的来源。本文的结构如下。第2节描述了在KorAP中搜索是如何工作的(从接收搜索请求开始,直到返回搜索结果)。第3节解释了语料库数据如何在Krill中表示和索引。第4节描述了Krill处理的各种查询,以及如何处理它们以在索引上进行实际搜索。第5节描述了包含搜索结果的Krill响应格式。我们在第6节和第7节分别介绍了相关的和进一步的工作。论文以摘要结尾。
{"title":"Krill: KorAP search and analysis engine","authors":"Nils Diewald, Eliza Margaretha","doi":"10.21248/jlcl.31.2016.202","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.202","url":null,"abstract":"KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127392809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Automatisierter Abgleich des Lautstandes althochdeut-scher Wortformen 自动化对比声音的老调
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.209
Roland Mittmann
Um Texte einer Sprache automatisiert auf ihren möglichen Entstehungszeitraum und ihre dialektale Zugehörigkeit hin zu untersuchen, werden für jedes erwartete Graphem und jede Flexionsendung zunächst Entsprechungsregeln zwischen einer idealisierten Sprachform und den Sprachformen in Grammatiken beschriebener Zeit-Dialekt-Räume erfasst. Anschließend werden mithilfe eines Computerprogramms unter Anwendung dieser Regeln die belegten Wortformen mit ihren Entsprechungen in der idealisierten Sprachform abgeglichen und für jeden Text die Übereinstimmungsgrade mit den einzelnen Zeit-Dialekt-Räumen angegeben. Exemplarisch wird dieser Abgleich für eine althochdeutsche Wortform beschrieben und das Ergebnis der Analyse des zugehörigen Gesamttextes dargestellt. 1 Untersuchungsthema Seit jeher verändern sich Sprachen im Laufe der zeitlichen Entwicklung. Sobald ihre Sprechergemeinschaften in verschiedene Gruppen zerfallen, die nicht mehr dauerhaft miteinander in Kontakt stehen, entwickeln sie zudem verschiedene Varietäten. Solange die Normierung einer Sprache nicht erfolgt ist, bleibt die textliche Überlieferung daher sprachlich uneinheitlich. Auch innerhalb eines Textes können Schwankungen auftreten, etwa wenn Sprecher verschiedener Dialekte am selben Text arbeiten oder einen bestehenden Text korrigieren (vgl. etwa BRAUNE/REIFFENSTEIN 2004, § 3 und Anm. 1). Ein einzelner Autor kann ebenfalls verschiedenen dialektalen Einflüssen unterworfen sein oder die im Laufe seines Lebens erfolgte sprachliche Veränderung in seinen Niederschriften wiedergeben. Da vor der Erfindung des Buchdrucks Texte allein durch Abschrift vervielfältigt wurden, kam es schließlich auch seitens der Kopisten – bewusst oder unbewusst – zu sprachlichen Anpassungen bei dialektalen Formen bzw. infolge der zeitlichen Entwicklung. Sind zu einem Teil der textlichen Überlieferung einer Sprache keine genaueren zeitlichen und örtlichen Angaben bekannt, erscheint es denkbar, diese automatisiert auf ihre Übereinstimmung mit verschiedenen Zeit-Dialekt-Räumen – also Zeitabschnitten mit Bezug auf die verschiedenen örtlichen Varietäten – zu untersuchen. Diese Untersuchung wird im Folgenden beschrieben. Voraussetzung dafür ist, dass Angaben zu den üblichen Entsprechungen der verschiedenen Phonem-GraphEntsprechungen (Lautverschriftungen, vgl. MITTMANN 2015b, 248) und der Flexionsendungen in den einzelnen Zeit-Dialekt-Räumen vorliegen.
为了自动分析一种语言的文本,对其可能产生的时间和辩证法归属进行自动化,我们会对每个预设的词汇和在语法中为发源地的方言和表现方式之间的任何可适应规则进行分析。然后利用利用这些规则的计算机程序,比较原始文字的对应版本与理性语言版本的对应版本,并为每一文本提供与每个时间方言对应的对应程度。字符串的分析得出了相应的全文。研究人员提出了词句的基本分析结果。1早期研究课题,语言随时间的推移而改变。一旦它们的群体分裂为多个不再保持联系的群体,它们也会变得多样化。因此,只要一律对某一种语言制定不了标准,文字传统就会一律不统一。因此,即使某一文本内也会有差异,例如不同方言的发音对同一文本展开工作,或者修改现有的文本。(2004年约棕色/ REIFFENSTEIN§3荣誉. 1).一个作者也可以是各种dialektalen影响附庸或他们一生的语言的引用.转变了他的怀疑在印刷术发明之前,文字只能被复制成份,后来复印的人,不管是有意还是无意间,都能在语言上对方言的表达方式,以及时间上的变化。如果人们对某个语言的文本传统未能提供更准确的时间和地方信息,那么就可能自动分析其对应于不同时间方言的对应范围。这份调查报告如下。基于这一理由,不同音条应的标准基础可以追溯到(联络会议)。2015b)和变换线的单元。
{"title":"Automatisierter Abgleich des Lautstandes althochdeut-scher Wortformen","authors":"Roland Mittmann","doi":"10.21248/jlcl.31.2016.209","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.209","url":null,"abstract":"Um Texte einer Sprache automatisiert auf ihren möglichen Entstehungszeitraum und ihre dialektale Zugehörigkeit hin zu untersuchen, werden für jedes erwartete Graphem und jede Flexionsendung zunächst Entsprechungsregeln zwischen einer idealisierten Sprachform und den Sprachformen in Grammatiken beschriebener Zeit-Dialekt-Räume erfasst. Anschließend werden mithilfe eines Computerprogramms unter Anwendung dieser Regeln die belegten Wortformen mit ihren Entsprechungen in der idealisierten Sprachform abgeglichen und für jeden Text die Übereinstimmungsgrade mit den einzelnen Zeit-Dialekt-Räumen angegeben. Exemplarisch wird dieser Abgleich für eine althochdeutsche Wortform beschrieben und das Ergebnis der Analyse des zugehörigen Gesamttextes dargestellt. 1 Untersuchungsthema Seit jeher verändern sich Sprachen im Laufe der zeitlichen Entwicklung. Sobald ihre Sprechergemeinschaften in verschiedene Gruppen zerfallen, die nicht mehr dauerhaft miteinander in Kontakt stehen, entwickeln sie zudem verschiedene Varietäten. Solange die Normierung einer Sprache nicht erfolgt ist, bleibt die textliche Überlieferung daher sprachlich uneinheitlich. Auch innerhalb eines Textes können Schwankungen auftreten, etwa wenn Sprecher verschiedener Dialekte am selben Text arbeiten oder einen bestehenden Text korrigieren (vgl. etwa BRAUNE/REIFFENSTEIN 2004, § 3 und Anm. 1). Ein einzelner Autor kann ebenfalls verschiedenen dialektalen Einflüssen unterworfen sein oder die im Laufe seines Lebens erfolgte sprachliche Veränderung in seinen Niederschriften wiedergeben. Da vor der Erfindung des Buchdrucks Texte allein durch Abschrift vervielfältigt wurden, kam es schließlich auch seitens der Kopisten – bewusst oder unbewusst – zu sprachlichen Anpassungen bei dialektalen Formen bzw. infolge der zeitlichen Entwicklung. Sind zu einem Teil der textlichen Überlieferung einer Sprache keine genaueren zeitlichen und örtlichen Angaben bekannt, erscheint es denkbar, diese automatisiert auf ihre Übereinstimmung mit verschiedenen Zeit-Dialekt-Räumen – also Zeitabschnitten mit Bezug auf die verschiedenen örtlichen Varietäten – zu untersuchen. Diese Untersuchung wird im Folgenden beschrieben. Voraussetzung dafür ist, dass Angaben zu den üblichen Entsprechungen der verschiedenen Phonem-GraphEntsprechungen (Lautverschriftungen, vgl. MITTMANN 2015b, 248) und der Flexionsendungen in den einzelnen Zeit-Dialekt-Räumen vorliegen.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134190084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketchfor Aiding Reconstruction 格皮:古格鲁吉亚语铭文语料库和辅助重建的工具草图
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.210
Armin Hoenen, Lela Samushia
In the current paper, an annotated corpus of Old Georgian inscriptions is introduced. The corpus contains 91 inscriptions which have been annotated in the standard epigraphic XML format EpiDoc, part of the TEI. Secondly, a prototype tool for helping epigraphic reconstruction is designed based on the inherent needs of epigraphy. The prototype backend uses word embeddings and frequencies generated from a corpus of Old Georgian to determine possible gap fillers. The method is applied to the gaps in the corpus and generates promising results. A sketch of a front end is being designed.
本文介绍了一种古格鲁吉亚铭文注释语料库。该语料库包含91个题词,这些题词以标准题词XML格式EpiDoc进行了注释,EpiDoc是TEI的一部分。其次,从金石的内在需求出发,设计了一个帮助金石重建的原型工具。原型后端使用从Old Georgian语料库生成的词嵌入和频率来确定可能的空白填充。将该方法应用于语料库中的空白,取得了令人满意的结果。前端的草图正在设计中。
{"title":"Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketchfor Aiding Reconstruction","authors":"Armin Hoenen, Lela Samushia","doi":"10.21248/jlcl.31.2016.210","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.210","url":null,"abstract":"In the current paper, an annotated corpus of Old Georgian inscriptions is introduced. The corpus contains 91 inscriptions which have been annotated in the standard epigraphic XML format EpiDoc, part of the TEI. Secondly, a prototype tool for helping epigraphic reconstruction is designed based on the inherent needs of epigraphy. The prototype backend uses word embeddings and frequencies generated from a corpus of Old Georgian to determine possible gap fillers. The method is applied to the gaps in the corpus and generates promising results. A sketch of a front end is being designed.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"403 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123094344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access 中古高地德语参考语料库——语料库的汇编、注释和访问
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.208
Florian Petran, Marcel Bollmann, Stefanie Dipper, Thomas Klein
This paper describes ReM and the results of the ReM project and its predecessors. All projects closely collaborate in developing common annotation standards to allow for diachronic investigations. ReA has already been published and made available via the corpus search tool ANNIS2 (Krause and Zeldes, 2016), while ReF and ReN are still in the annotation process. The ReM project builds on several earlier annotation efforts, such as the corpus of the new Middle High German Grammar (MiGraKo, Klein et al. (2009)), expanding them and adding further texts, to produce a reference corpus for Middle High German, which we will also call “ReM” for short. The combined corpus, which consists of around two million tokens, provides a mostly complete collection of written records from Early Middle High German (1050–1200) as well as a selection of Middle High German texts from 1200 to 1350. Texts have been digitized and annotated with parts of speech and morphology (using the HiTS tagset, cf. Dipper et al. (2013)) as well as lemma information. Release 1.0 of ReM has been published in December 2016 and is also accessible via the ANNIS tool. The project website at https://www.linguistics.ruhr-uni-bochum. de/rem/ offers extensive documentation of the project and the corpus. The corpus
本文描述了ReM和ReM项目及其前身的成果。所有项目都密切合作开发通用注释标准,以便进行历时性调查。ReA已经发布,并通过语料库搜索工具ANNIS2提供(Krause and Zeldes, 2016),而ReF和ReN仍在注释过程中。ReM项目建立在几个早期注释工作的基础上,例如新的中古高地德语语法语料库(MiGraKo, Klein等人(2009)),扩展它们并添加进一步的文本,以生成中古高地德语的参考语料库,我们也将其简称为“ReM”。合并的语料库由大约200万个符号组成,提供了早期中古高地德语(1050-1200)的大部分完整的书面记录,以及1200年至1350年的中古高地德语文本的选择。文本已被数字化,并使用词性和词法(使用HiTS标签集,参见Dipper等人(2013))以及引理信息进行注释。ReM的1.0版本已于2016年12月发布,也可以通过ANNIS工具访问。该项目网站为https://www.linguistics.ruhr-uni-bochum。De /rem/提供了项目和语料库的大量文档。的语料库
{"title":"ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access","authors":"Florian Petran, Marcel Bollmann, Stefanie Dipper, Thomas Klein","doi":"10.21248/jlcl.31.2016.208","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.208","url":null,"abstract":"This paper describes ReM and the results of the ReM project and its predecessors. All projects closely collaborate in developing common annotation standards to allow for diachronic investigations. ReA has already been published and made available via the corpus search tool ANNIS2 (Krause and Zeldes, 2016), while ReF and ReN are still in the annotation process. The ReM project builds on several earlier annotation efforts, such as the corpus of the new Middle High German Grammar (MiGraKo, Klein et al. (2009)), expanding them and adding further texts, to produce a reference corpus for Middle High German, which we will also call “ReM” for short. The combined corpus, which consists of around two million tokens, provides a mostly complete collection of written records from Early Middle High German (1050–1200) as well as a selection of Middle High German texts from 1200 to 1350. Texts have been digitized and annotated with parts of speech and morphology (using the HiTS tagset, cf. Dipper et al. (2013)) as well as lemma information. Release 1.0 of ReM has been published in December 2016 and is also accessible via the ANNIS tool. The project website at https://www.linguistics.ruhr-uni-bochum. de/rem/ offers extensive documentation of the project and the corpus. The corpus","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122043735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Merging and validating heterogenous, multi-layered corpora with discoursegraphs 用语篇合并和验证异质、多层语料库
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.204
Arne Neumann
We present discoursegraphs, a library and command-line application for the conversion and merging of linguistic annotations written in Python. The software reads and writes numerous formats for syntactic and discourse-related annotations, but also supports generic interchange formats. discoursegraphs models primary data and its annotations as a graph and is therefore able to merge multiple independent, possibly conflicting annotation layers into a unified representation. We show how this approach is beneficial for the revision and validation of a corpus with multiple conflicting, independently annotated layers.
我们介绍了dissegraphs,一个库和命令行应用程序,用于转换和合并用Python编写的语言注释。该软件读取和写入许多格式的语法和论述相关的注释,但也支持通用的交换格式。dissegraphs将原始数据及其注释建模为一个图,因此能够将多个独立的、可能相互冲突的注释层合并为一个统一的表示。我们展示了这种方法如何有利于具有多个冲突的、独立注释层的语料库的修订和验证。
{"title":"Merging and validating heterogenous, multi-layered corpora with discoursegraphs","authors":"Arne Neumann","doi":"10.21248/jlcl.31.2016.204","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.204","url":null,"abstract":"We present discoursegraphs, a library and command-line application for the conversion and merging of linguistic annotations written in Python. The software reads and writes numerous formats for syntactic and discourse-related annotations, but also supports generic interchange formats. discoursegraphs models primary data and its annotations as a graph and is therefore able to merge multiple independent, possibly conflicting annotation layers into a unified representation. We show how this approach is beneficial for the revision and validation of a corpus with multiple conflicting, independently annotated layers.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121963152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SpoCo - a simple and adaptable web interface for dialect corpora SpoCo -一个简单和适应性强的方言语料库web界面
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.206
R. Waldenfels, Michal Wozniak
We present SpoCo, a simple, yet effective system for the web-based query of dialect corpora encoded in ELAN that provides users with advanced concordancing functions, as well as the the possibility to edit and correct transcriptions if needed. SpoCo is easy to use and maintain, and can be adapted to different spoken corpora in a straightforward way. Simplicity is emphasized to facilitate use by a wide range of users and research groups, including those with limited technical and financial resources, and encourage collaboration and data exchange across such groups. Relying on existing technology and pursuing a modular architecture, SpoCo is developed bottom-up: it was initially devised for a specific dialect project and is being continually adapted for use in other projects in a network of Slavic dialect projects that cooperate in tool development and data sharing. SpoCo thus takes a middle position between systems that are developed for the purposes of a specific dialect corpus, on the one hand, and general-use systems designed for a wide range of data and usage cases, on the other.
我们提出了SpoCo,一个简单而有效的系统,用于基于web的方言语料库查询,在ELAN中编码,为用户提供先进的检索功能,以及编辑和纠正转录的可能性,如果需要。SpoCo易于使用和维护,并且可以以直接的方式适应不同的口语语料库。强调简单性是为了方便广泛的用户和研究小组,包括技术和财政资源有限的用户和研究小组使用,并鼓励这些小组之间的协作和数据交换。SpoCo以现有技术为基础,采用模块化架构,采用自下而上的方式开发:它最初是为特定的方言项目设计的,并不断进行调整,以用于在工具开发和数据共享方面合作的斯拉夫方言项目网络中的其他项目。因此,SpoCo介于为特定方言语料库而开发的系统和为广泛的数据和用例而设计的通用系统之间。
{"title":"SpoCo - a simple and adaptable web interface for dialect corpora","authors":"R. Waldenfels, Michal Wozniak","doi":"10.21248/jlcl.31.2016.206","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.206","url":null,"abstract":"We present SpoCo, a simple, yet effective system for the web-based query of dialect corpora encoded in ELAN that provides users with advanced concordancing functions, as well as the the possibility to edit and correct transcriptions if needed. SpoCo is easy to use and maintain, and can be adapted to different spoken corpora in a straightforward way. Simplicity is emphasized to facilitate use by a wide range of users and research groups, including those with limited technical and financial resources, and encourage collaboration and data exchange across such groups. Relying on existing technology and pursuing a modular architecture, SpoCo is developed bottom-up: it was initially devised for a specific dialect project and is being continually adapted for use in other projects in a network of Slavic dialect projects that cooperate in tool development and data sharing. SpoCo thus takes a middle position between systems that are developed for the purposes of a specific dialect corpus, on the one hand, and general-use systems designed for a wide range of data and usage cases, on the other.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130125241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora graphANNIS:深度标注语料库的快速查询引擎
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.199
Thomas Krause, U. Leser, Anke Lüdeling
We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of finding paths in a graph and present efficient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.
我们提出graphANNIS,一个快速实现的建立查询语言AQL处理深度注释的语言语料库。AQL构建在基于图的抽象上,用于建模和交换语言数据,但其当前的所有实现都使用关系数据库作为存储层。相反,graphANNIS直接在主存中实现ANNIS图形数据模型。我们展示了绝大多数AQL功能可以映射到在图中查找路径的基本操作,并为该操作和所有其他所需操作提供了有效的实现和索引结构。我们将graphANNIS的性能与标准的基于sql的AQL实现进行比较,使用超过3000个真实查询的工作负载,在17个开放语料库上,每个语料库的大小高达300万个令牌,其注释范围从简单的线性词性标记到深度嵌套的话语结构。对于整个工作负载,graphANNIS要快40倍以上,在不到3%的查询中要慢一些。graphANNIS以及用于评估的工作负载和语料库都可以在GitHub和Zenodo开放存取档案中免费获得。
{"title":"graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora","authors":"Thomas Krause, U. Leser, Anke Lüdeling","doi":"10.21248/jlcl.31.2016.199","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.199","url":null,"abstract":"We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of finding paths in a graph and present efficient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131466472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
PAL, a tool for Pre-annotation and Active Learning PAL,预标注和主动学习工具
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.203
Maria Skeppstedt, C. Paradis, A. Kerren
Many natural language processing systems rely on machine learning models that are trained on large amounts of manually annotated text data. The lack of sufficient amounts of annotated data is, however, a common obstacle for such systems, since manual annotation of text is often expensive and time-consuming. The aim of “PAL", a tool for Pre-annotation and Active Learning” is to provide a ready-made package that can be used to simplify annotation and to reduce the amount of annotated data required to train a machine learning classifier. The package provides support for two techniques that have been shown to be successful in previous studies, namely active learning and pre-annotation. The output of the pre-annotation is provided in the annotation format of the annotation tool BRAT, but PAL is a stand-alone package that can be adapted to other formats. (Less)
许多自然语言处理系统依赖于机器学习模型,这些模型是在大量手动注释的文本数据上训练的。然而,缺乏足够数量的注释数据是此类系统的一个常见障碍,因为手动注释文本通常既昂贵又耗时。“Pre-annotation and Active Learning”工具“PAL”的目的是提供一个现成的包,可以用来简化标注,减少训练机器学习分类器所需的标注数据量。该包为两种技术提供了支持,这两种技术在以前的研究中已经被证明是成功的,即主动学习和预注释。预注释的输出以注释工具BRAT的注释格式提供,但是PAL是一个可以适应其他格式的独立包。(少)
{"title":"PAL, a tool for Pre-annotation and Active Learning","authors":"Maria Skeppstedt, C. Paradis, A. Kerren","doi":"10.21248/jlcl.31.2016.203","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.203","url":null,"abstract":"Many natural language processing systems rely on machine learning models that are trained on large amounts of manually annotated text data. The lack of sufficient amounts of annotated data is, however, a common obstacle for such systems, since manual annotation of text is often expensive and time-consuming. The aim of “PAL\", a tool for Pre-annotation and Active Learning” is to provide a ready-made package that can be used to simplify annotation and to reduce the amount of annotated data required to train a machine learning classifier. The package provides support for two techniques that have been shown to be successful in previous studies, namely active learning and pre-annotation. The output of the pre-annotation is provided in the annotation format of the annotation tool BRAT, but PAL is a stand-alone package that can be adapted to other formats. (Less)","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133925824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Data Mining Software for Corpus Linguistics with Application in Diachronic Linguistics 语料库语言学数据挖掘软件及其在历时语言学中的应用
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.201
Christian Pölitz
Large digital copora have become a valuable resource for linguistic research. We introduce a software tool to efficiently perform Data Mining tasks for diachronic linguistics to investigate linguistic phenomena with respect to time. As a running example, we show a topic model that extracts different meanings from large digital copora over time.
大型数字语料已成为语言学研究的宝贵资源。我们介绍了一个软件工具来有效地执行历时语言学的数据挖掘任务,以研究与时间有关的语言现象。作为一个运行的例子,我们展示了一个主题模型,该模型随着时间的推移从大型数字语词中提取不同的含义。
{"title":"Data Mining Software for Corpus Linguistics with Application in Diachronic Linguistics","authors":"Christian Pölitz","doi":"10.21248/jlcl.31.2016.201","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.201","url":null,"abstract":"Large digital copora have become a valuable resource for linguistic research. We introduce a software tool to efficiently perform Data Mining tasks for diachronic linguistics to investigate linguistic phenomena with respect to time. As a running example, we show a topic model that extracts different meanings from large digital copora over time.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project 口语交互语料库的构建和传播——FOLK项目中的工具和工作流程
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.205
Thomas C. Schmidt
This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank fur Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.
本文研究了德语口语研究与教学语料库FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch)的构建与传播工作流程。FOLK是一个以音频和视频形式记录的大型真实口语互动数据语料库。第2节详细描述了在数据的转录、匿名化、正字法规范化、词形化和词性标注等各个步骤中使用的工具,以及用于语料库管理的一些实用程序。第3节讨论了DGD(德语口语数据库)作为分发完整数据集并使其可用于定性和定量分析的工具。在第4节中,概述了进一步发展的一些计划。
{"title":"Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project","authors":"Thomas C. Schmidt","doi":"10.21248/jlcl.31.2016.205","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.205","url":null,"abstract":"This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank fur Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117140671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
J. Lang. Technol. Comput. Linguistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1