Mining historical texts for diachronic spelling variants

IF 0.6 4区文学 0 LANGUAGE & LINGUISTICS Poznan Studies in Contemporary Linguistics Pub Date : 2020-12-01 DOI:10.1515/psicl-2020-0021

F. Gralinski, K. Jassem

引用次数: 1

Abstract

Abstract The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

挖掘历史文本的历时拼写变体

摘要本文描述了一种在语料库中发现历时拼写变体的方法，该语料库由历史和现代波兰文本组成。该程序应用Levenshtein距离和由Word2vec模型确定的相似性度量。该方法适用于词和子词单元。手工评估拼写变体样本，并与现有的波兰历史文本形态学分析器进行比较。在文本现代化工具中使用了拼写变体和拼写现代化规则的结果列表，并评估了它们的贡献。本文还提出了一种类似的方法来查找由错误OCR引起的拼写变体。获得的OCR变体和规则列表可用于OCR输出的校正。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Poznan Studies in Contemporary Linguistics Multiple-

CiteScore

1.00

自引率

0.00%

发文量