Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

IF 2 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Data Pub Date : 2023-11-10 DOI:10.3390/data8110170

Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer

引用次数: 0

Abstract

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeReKoGram:一种新的德语引理和词性信息频率数据集

我们介绍了DeReKoGram，这是一个新的频率数据集，包含来自德语参考语料库的1-，2-和3-g的引理和词性(POS)信息。该数据集包含基于432亿个token的语料库的信息，并基于16个语料库折叠分为16个部分。我们描述了数据集是如何创建和结构化的。通过评估16个折叠的分布，我们展示了在许多用例中使用折叠的子集是可能的(例如，为了节省计算资源)。在一个案例研究中，我们研究了随着分析中包含的折叠数量的增加，词汇量的增长(以及偶合现象的数量)。我们将其与数据集的各个清理阶段交叉结合。我们还以Python、R和Stata markdown脚本的形式提供了一些关于如何使用该资源的指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊