The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book

IF 1.1 0 LANGUAGE & LINGUISTICS Word Structure Pub Date : 2023-11-01 DOI:10.3366/word.2023.0230

Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík

{"title":"The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book","authors":"Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík","doi":"10.3366/word.2023.0230","DOIUrl":null,"url":null,"abstract":"This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.","PeriodicalId":43166,"journal":{"name":"Word Structure","volume":"15 2","pages":"0"},"PeriodicalIF":1.1000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Word Structure","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3366/word.2023.0230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结构主义传统与经验数据的相遇:语料库数据对捷克网络语言参考书的增强

本文展示了语料库语法工具GramatiKat如何改进和完善网络语言工具书(ILRB)中的形态信息，该工具书提供了45,632个标准捷克语名词的完整变格范式。范式表主要基于形态类型，遵循语言作为一个完全铰接系统的结构主义概念。本文讨论了如何更新语料库，为用户提供基于经验的语料库范式中每个单元中单个词形的语法信息。使用GramatiKat工具研究捷克语的语法类别，对所有名词引理进行了调查。该工具观察特定词素的词形分布，并与整个词类的标准分布进行比较。它能够识别在某种词形中出现频率异常高的名词，以及具有未经证实的词形的名词。GramatiKat基于两个捷克语语料库SYN2015和SYN2020(2亿个单词标记)的数据。本文探讨了语言的缺陷和过剩与语言的变异性和潜能之间的关系。基于ILRB和GramatiKat数据的独特组合，本文提出了如何指出异常频繁或过多的词形信息以及未经证实的词形信息，以便ILRB为用户提供准确的、基于经验的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊