Building Linguistic Corpora from Wikipedia Articles and Discussions

J. Lang. Technol. Comput. Linguistics Pub Date : 2014-07-01 DOI:10.21248/jlcl.29.2014.189

Eliza Margaretha, H. Lüngen

引用次数: 36

Abstract

Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从维基百科文章和讨论建立语言语料库

维基百科是一个有价值的资源，作为语言语料库或许多研究的数据集都很有用。我们以I5格式从维基百科文章和讨论页构建语料库，这是德语参考语料库(Deutsches Referenzkorpus DeReKo)中使用的TEI定制。我们的方法是两阶段转换，结合使用Sweble解析器进行解析和使用XSLT样式表进行转换。这种转换方法能够成功地生成丰富有效的语料库。我们还介绍了一种将讨论页中的用户贡献分割为帖子的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

J. Lang. Technol. Comput. Linguistics

自引率

0.00%

发文量