Discriminating between Similar Languages using Weighted Subword Features

Workshop on NLP for Similar Languages, Varieties and Dialects Pub Date : 1900-01-01 DOI:10.18653/v1/W17-1223

A. Barbaresi

引用次数: 10

Abstract

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于加权子词特征的相似语言判别

目前的贡献围绕着一个对比子词n-gram模型，该模型已经在相似语言之间的区分共享任务中进行了测试。我提出并讨论了这个由6个主要语言群体组成的14种语言识别任务中使用的方法。它具有以下特点:(1)对文档集合进行预处理并转换为稀疏特征;(2)加权特征n图轮廓;(3)多项贝叶斯分类器。有意义的n-grams特征可以直接用作系统，我的方法优于DSL共享任务中使用的大多数系统(排名第三)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Workshop on NLP for Similar Languages, Varieties and Dialects

自引率

0.00%

发文量