使用类型化和非类型化字符n -图和词的组合来区分相似语言

Workshop on NLP for Similar Languages, Varieties and Dialects Pub Date : 2017-04-01 DOI:10.18653/v1/W17-1217

Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto

{"title":"使用类型化和非类型化字符n -图和词的组合来区分相似语言","authors":"Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto","doi":"10.18653/v1/W17-1217","DOIUrl":null,"url":null,"abstract":"This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words\",\"authors\":\"Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto\",\"doi\":\"10.18653/v1/W17-1217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.\",\"PeriodicalId\":167439,\"journal\":{\"name\":\"Workshop on NLP for Similar Languages, Varieties and Dialects\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on NLP for Similar Languages, Varieties and Dialects\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W17-1217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on NLP for Similar Languages, Varieties and Dialects","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W17-1217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

本文介绍了在VarDial 2017研讨会上参加区分相似语言(DSL)共享任务的cic_ualg系统。今年的任务旨在利用新闻文本节选语料库识别6个语言群体中的14种语言。比较了两种分类方法:单步(所有语言)方法和两步(语言组和组内语言)方法。被利用的特征包括词汇特征(单字)和字符n-图。除了传统的(非类型化的)字符n-图之外，我们在DSL任务中引入了类型化的字符n-图。实验使用不同的特征表示方法(二进制和原始项频率)、频率阈值和机器学习算法-支持向量机(SVM)和多项朴素贝叶斯(MNB)进行。我们在DSL任务中的最佳运行达到了91.46%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on NLP for Similar Languages, Varieties and Dialects

自引率

0.00%

发文量