首页 > 最新文献

Workshop on NLP for Similar Languages, Varieties and Dialects最新文献

英文 中文
A Perplexity-Based Method for Similar Languages Discrimination 基于困惑度的相似语言判别方法
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1213
Pablo Gamallo, José Ramom Pichel Campos, I. Alegria
This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.
本文描述了Citius_Ixa_Imaxin团队提交给VarDial 2017的系统(DSL和GDI任务)。我们系统的基本策略是基于通过模型困惑计算的语言距离。我们测试过的最好的模型配置是使用单词和字符的几个n-gram模型的投票系统,即使单词unigrams在我们参与的任务中被证明是一个非常有竞争力的模型,结果也很合理。进行了错误分析,其中我们确定了许多没有语言证据的测试示例来区分变体。
{"title":"A Perplexity-Based Method for Similar Languages Discrimination","authors":"Pablo Gamallo, José Ramom Pichel Campos, I. Alegria","doi":"10.18653/v1/W17-1213","DOIUrl":"https://doi.org/10.18653/v1/W17-1213","url":null,"abstract":"This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126159873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Evaluating HeLI with Non-Linear Mappings 用非线性映射求HeLI
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1212
T. Jauhiainen, Krister Lindén, H. Jauhiainen
In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated on the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe the HeLI method and the non-linear mappings in mathematical notation. The HeLI method uses a probabilistic model with character n-grams and word-based backoff. We also describe our trials using the non-linear mappings instead of relative frequencies and we present statistics about the back-off function of the HeLI method.
在本文中,我们描述了我们在第4版的区分相似语言(DSL)共享任务中与赫尔辛基语言识别方法HeLI一起使用的非线性映射,该任务是作为VarDial 2017研讨会的一部分组织的。我们的SUKI团队与其他10支团队一起参加了封闭赛道。我们的系统在赛道上达到了第七名。我们用数学符号描述了HeLI方法和非线性映射。HeLI方法使用具有字符n-grams和基于单词的退退的概率模型。我们还使用非线性映射代替相对频率来描述我们的试验,并给出了HeLI方法的回退函数的统计数据。
{"title":"Evaluating HeLI with Non-Linear Mappings","authors":"T. Jauhiainen, Krister Lindén, H. Jauhiainen","doi":"10.18653/v1/W17-1212","DOIUrl":"https://doi.org/10.18653/v1/W17-1212","url":null,"abstract":"In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated on the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe the HeLI method and the non-linear mappings in mathematical notation. The HeLI method uses a probabilistic model with character n-grams and word-based backoff. We also describe our trials using the non-linear mappings instead of relative frequencies and we present statistics about the back-off function of the HeLI method.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114796890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Computational analysis of Gondi dialects Gondi方言的计算分析
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1203
Taraka Rama, Çagri Çöltekin, Pavel Sofroniev
This paper presents a computational analysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data using different techniques from dialectometry, deep learning, and computational biology. We show that the methods largely agree with each other and with the earlier non-computational analyses of the language group.
本文对印度中部的刚地方言进行了计算分析。我们提出了一个方言区的数字化数据集,并使用方言学、深度学习和计算生物学等不同技术对数据进行了分析。我们表明,这些方法在很大程度上彼此一致,并与早期对语言群体的非计算分析一致。
{"title":"Computational analysis of Gondi dialects","authors":"Taraka Rama, Çagri Çöltekin, Pavel Sofroniev","doi":"10.18653/v1/W17-1203","DOIUrl":"https://doi.org/10.18653/v1/W17-1203","url":null,"abstract":"This paper presents a computational analysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data using different techniques from dialectometry, deep learning, and computational biology. We show that the methods largely agree with each other and with the earlier non-computational analyses of the language group.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134029583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth 推特语言识别相似的语言和方言没有根据的事实
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1209
Jennifer Williams, Charlie K. Dagli
We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geo-location, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.
我们提出了一种新的方法来引导过滤Twitter语言ID标签在我们的数据集中用于自动语言识别(LID)。我们的方法结合了地理定位、原始Twitter LID标签和Amazon Mechanical Turk来解决标签缺失和不可靠的问题。我们是第一个使用MIRA算法和langid.py比较LID分类性能的人。我们在不同版本的数据集上展示了分类器的性能,仅使用Twitter数据,没有真实值,并且训练示例很少,准确率很高。我们还展示了如何使用Platt Scaling将MIRA分类器的输出值校准为候选类的概率分布,从而使输出更直观。我们的方法允许在相似的语言和方言之间进行细粒度的区分,并允许我们重新发现Twitter数据集的语言组成。
{"title":"Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth","authors":"Jennifer Williams, Charlie K. Dagli","doi":"10.18653/v1/W17-1209","DOIUrl":"https://doi.org/10.18653/v1/W17-1209","url":null,"abstract":"We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geo-location, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125196213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing VarDial 2017中的t<s:1> bingen系统共享任务:语言识别和跨语言解析实验
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1218
Çagri Çöltekin, Taraka Rama
This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive results among other systems in the shared task. We also report additional experiments with neural network models. The performance of neural network models was close but always below the corresponding SVM classifiers in the discrimination tasks. For the cross-lingual parsing task, we experimented with an approach based on automatically translating the source treebank to the target language, and training a parser on the translated treebank. We used off-the-shelf tools for both translation and parsing. Despite achieving better-than-baseline results, our scores in CLP tasks were substantially lower than the scores of the other participants.
本文描述了我们在VarDial 2017共享任务上的系统和结果。除了三个语言/方言识别任务外,我们还使用一种简单的方法参与了跨语言依赖解析(CLP)任务,我们也在本文中简要描述了这种方法。对于所有的识别任务,我们使用具有字符和单词特征的线性支持向量机。该系统在共享任务中实现了与其他系统的竞争。我们还报告了神经网络模型的附加实验。在识别任务中,神经网络模型的性能接近,但始终低于相应的支持向量机分类器。对于跨语言解析任务,我们尝试了一种基于自动将源树库翻译成目标语言的方法,并在翻译后的树库上训练解析器。我们使用现成的工具进行翻译和解析。尽管取得了比基线更好的结果,我们在CLP任务中的得分明显低于其他参与者的得分。
{"title":"Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing","authors":"Çagri Çöltekin, Taraka Rama","doi":"10.18653/v1/W17-1218","DOIUrl":"https://doi.org/10.18653/v1/W17-1218","url":null,"abstract":"This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive results among other systems in the shared task. We also report additional experiments with neural network models. The performance of neural network models was close but always below the corresponding SVM classifiers in the discrimination tasks. For the cross-lingual parsing task, we experimented with an approach based on automatically translating the source treebank to the target language, and training a parser on the translated treebank. We used off-the-shelf tools for both translation and parsing. Despite achieving better-than-baseline results, our scores in CLP tasks were substantially lower than the scores of the other participants.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131881065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Discriminating between Similar Languages using Weighted Subword Features 基于加权子词特征的相似语言判别
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1223
A. Barbaresi
The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).
目前的贡献围绕着一个对比子词n-gram模型,该模型已经在相似语言之间的区分共享任务中进行了测试。我提出并讨论了这个由6个主要语言群体组成的14种语言识别任务中使用的方法。它具有以下特点:(1)对文档集合进行预处理并转换为稀疏特征;(2)加权特征n图轮廓;(3)多项贝叶斯分类器。有意义的n-grams特征可以直接用作系统,我的方法优于DSL共享任务中使用的大多数系统(排名第三)。
{"title":"Discriminating between Similar Languages using Weighted Subword Features","authors":"A. Barbaresi","doi":"10.18653/v1/W17-1223","DOIUrl":"https://doi.org/10.18653/v1/W17-1223","url":null,"abstract":"The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123365510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
Workshop on NLP for Similar Languages, Varieties and Dialects
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1