Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

IF 0.8 Q4 COMPUTER SCIENCE, THEORY & METHODS Applied Computer Systems Pub Date : 2021-12-01 DOI:10.2478/acss-2021-0016

Rolands Laucis, Gints Jēkabsons

引用次数: 0

Abstract

Abstract Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于公开语料库的拉脱维亚语NLP任务的词嵌入模型评价

目前，自然语言处理(NLP)越来越依赖于预训练的词嵌入来完成各种任务。然而，很少有研究专门针对拉脱维亚语——一种比英语更复杂的语言。本研究在三个NLP任务中对四种不同的词嵌入生成方法:word2vec、fastText、Structured Skip-Gram和ngram2vec进行了实验。所获得的结果可以作为未来拉脱维亚语在自然语言处理中的研究的基线。主要结论如下:首先，在词性任务中，使用的训练语料库比之前的研究小46倍，准确率为91.4%(而之前的研究为98.3%)。其次，fastText显示出最佳的总体效果。第三，对于尺寸为200的嵌入，所有方法的效果最好。最后，词序化一般不会改善结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Applied Computer Systems COMPUTER SCIENCE, THEORY & METHODS-

自引率

10.00%

发文量

审稿时长

30 weeks