使用类型化和非类型化字符n -图和词的组合来区分相似语言

Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto
{"title":"使用类型化和非类型化字符n -图和词的组合来区分相似语言","authors":"Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto","doi":"10.18653/v1/W17-1217","DOIUrl":null,"url":null,"abstract":"This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words\",\"authors\":\"Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto\",\"doi\":\"10.18653/v1/W17-1217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.\",\"PeriodicalId\":167439,\"journal\":{\"name\":\"Workshop on NLP for Similar Languages, Varieties and Dialects\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on NLP for Similar Languages, Varieties and Dialects\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W17-1217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on NLP for Similar Languages, Varieties and Dialects","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W17-1217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

摘要

本文介绍了在VarDial 2017研讨会上参加区分相似语言(DSL)共享任务的cic_ualg系统。今年的任务旨在利用新闻文本节选语料库识别6个语言群体中的14种语言。比较了两种分类方法:单步(所有语言)方法和两步(语言组和组内语言)方法。被利用的特征包括词汇特征(单字)和字符n-图。除了传统的(非类型化的)字符n-图之外,我们在DSL任务中引入了类型化的字符n-图。实验使用不同的特征表示方法(二进制和原始项频率)、频率阈值和机器学习算法-支持向量机(SVM)和多项朴素贝叶斯(MNB)进行。我们在DSL任务中的最佳运行达到了91.46%的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fine-Tuning BERT with Character-Level Noise for Zero-Shot Transfer to Dialects and Closely-Related Languages ZHAW-InIT - Social Media Geolocation at VarDial 2020 CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects Learning to Identify Arabic and German Dialects using Multiple Kernels Investigating Diatopic Variation in a Historical Corpus
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1