The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book

IF 0.7 N/A LANGUAGE & LINGUISTICS Word Structure Pub Date : 2023-11-01 DOI:10.3366/word.2023.0230
Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík
{"title":"The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book","authors":"Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík","doi":"10.3366/word.2023.0230","DOIUrl":null,"url":null,"abstract":"This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.","PeriodicalId":43166,"journal":{"name":"Word Structure","volume":null,"pages":null},"PeriodicalIF":0.7000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Word Structure","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3366/word.2023.0230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"N/A","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
结构主义传统与经验数据的相遇:语料库数据对捷克网络语言参考书的增强
本文展示了语料库语法工具GramatiKat如何改进和完善网络语言工具书(ILRB)中的形态信息,该工具书提供了45,632个标准捷克语名词的完整变格范式。范式表主要基于形态类型,遵循语言作为一个完全铰接系统的结构主义概念。本文讨论了如何更新语料库,为用户提供基于经验的语料库范式中每个单元中单个词形的语法信息。使用GramatiKat工具研究捷克语的语法类别,对所有名词引理进行了调查。该工具观察特定词素的词形分布,并与整个词类的标准分布进行比较。它能够识别在某种词形中出现频率异常高的名词,以及具有未经证实的词形的名词。GramatiKat基于两个捷克语语料库SYN2015和SYN2020(2亿个单词标记)的数据。本文探讨了语言的缺陷和过剩与语言的变异性和潜能之间的关系。基于ILRB和GramatiKat数据的独特组合,本文提出了如何指出异常频繁或过多的词形信息以及未经证实的词形信息,以便ILRB为用户提供准确的、基于经验的数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Word Structure
Word Structure LANGUAGE & LINGUISTICS-
CiteScore
1.60
自引率
0.00%
发文量
10
期刊最新文献
Studying negative evidence in Finnish language corpora The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book Uncertainty in the production of Czech noun and verb forms Realised overabundance in Estonian noun paradigms: A corpus study Front matter
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1