Mining historical texts for diachronic spelling variants

IF 0.5 4区 文学 N/A LANGUAGE & LINGUISTICS Poznan Studies in Contemporary Linguistics Pub Date : 2020-12-01 DOI:10.1515/psicl-2020-0021
F. Gralinski, K. Jassem
{"title":"Mining historical texts for diachronic spelling variants","authors":"F. Gralinski, K. Jassem","doi":"10.1515/psicl-2020-0021","DOIUrl":null,"url":null,"abstract":"Abstract The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.","PeriodicalId":43804,"journal":{"name":"Poznan Studies in Contemporary Linguistics","volume":null,"pages":null},"PeriodicalIF":0.5000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Poznan Studies in Contemporary Linguistics","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1515/psicl-2020-0021","RegionNum":4,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"N/A","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 1

Abstract

Abstract The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
挖掘历史文本的历时拼写变体
摘要本文描述了一种在语料库中发现历时拼写变体的方法,该语料库由历史和现代波兰文本组成。该程序应用Levenshtein距离和由Word2vec模型确定的相似性度量。该方法适用于词和子词单元。手工评估拼写变体样本,并与现有的波兰历史文本形态学分析器进行比较。在文本现代化工具中使用了拼写变体和拼写现代化规则的结果列表,并评估了它们的贡献。本文还提出了一种类似的方法来查找由错误OCR引起的拼写变体。获得的OCR变体和规则列表可用于OCR输出的校正。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.00
自引率
0.00%
发文量
0
期刊最新文献
The syntactic variety and semantic unity of the V de resultative construction in Mandarin Chinese A multi-dimensional analysis of corporate blogs Metaphors across cultures Complexity trade-off in morphosyntactic module: suggestions from Japanese dialects A survey of Polish ASR speech datasets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1