首页 > 最新文献

Workshop on NLP for Similar Languages, Varieties and Dialects最新文献

英文 中文
Fine-Tuning BERT with Character-Level Noise for Zero-Shot Transfer to Dialects and Closely-Related Languages 带有字符级噪声的微调BERT用于方言和密切相关语言的零射击迁移
Pub Date : 2023-03-30 DOI: 10.48550/arXiv.2303.17683
Aarohi Srivastava, David Chiang
In this work, we induce character-level noise in various forms when fine-tuning BERT to enable zero-shot cross-lingual transfer to unseen dialects and languages. We fine-tune BERT on three sentence-level classification tasks and evaluate our approach on an assortment of unseen dialects and languages. We find that character-level noise can be an extremely effective agent of cross-lingual transfer under certain conditions, while it is not as helpful in others. Specifically, we explore these differences in terms of the nature of the task and the relationships between source and target languages, finding that introduction of character-level noise during fine-tuning is particularly helpful when a task draws on surface level cues and the source-target cross-lingual pair has a relatively high lexical overlap with shorter (i.e., less meaningful) unseen tokens on average.
在这项工作中,我们在微调BERT时引入了各种形式的字符级噪声,以实现零采样跨语言迁移到看不见的方言和语言。我们在三个句子级别的分类任务上对BERT进行微调,并在各种看不见的方言和语言上评估我们的方法。我们发现,在某些条件下,字符级噪声可能是跨语言迁移的一个非常有效的代理,而在其他情况下则没有那么有用。具体来说,我们从任务的性质以及源语言和目标语言之间的关系方面探讨了这些差异,发现在微调过程中引入字符级噪声在任务利用表面水平线索时特别有用,并且源-目标跨语言对平均具有相对较高的词汇重叠,其中包含较短(即意义较小)的未见标记。
{"title":"Fine-Tuning BERT with Character-Level Noise for Zero-Shot Transfer to Dialects and Closely-Related Languages","authors":"Aarohi Srivastava, David Chiang","doi":"10.48550/arXiv.2303.17683","DOIUrl":"https://doi.org/10.48550/arXiv.2303.17683","url":null,"abstract":"In this work, we induce character-level noise in various forms when fine-tuning BERT to enable zero-shot cross-lingual transfer to unseen dialects and languages. We fine-tune BERT on three sentence-level classification tasks and evaluate our approach on an assortment of unseen dialects and languages. We find that character-level noise can be an extremely effective agent of cross-lingual transfer under certain conditions, while it is not as helpful in others. Specifically, we explore these differences in terms of the nature of the task and the relationships between source and target languages, finding that introduction of character-level noise during fine-tuning is particularly helpful when a task draws on surface level cues and the source-target cross-lingual pair has a relatively high lexical overlap with shorter (i.e., less meaningful) unseen tokens on average.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132412383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ZHAW-InIT - Social Media Geolocation at VarDial 2020 社交媒体地理定位在VarDial 2020
Pub Date : 2020-12-13 DOI: 10.21256/ZHAW-21551
Fernando Benites, M. Hürlimann, Pius von Däniken, Mark Cieliebak
We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems).
我们在VarDial评估活动2020中描述了我们对社交媒体地理定位(SMG)任务的方法。目标是在给定输入文本的情况下预测地理位置(经纬度)。有三个子任务对应于德语区瑞士、德国和奥地利以及克罗地亚、波斯尼亚-黑塞哥维那、黑山和塞尔维亚。我们提交了所有子任务的解决方案,但将开发精力集中在CH子任务上,我们在16个提交的解决方案中获得了第三名,中位距离为15.93公里,并且在14个无约束系统中获得了最佳结果。在DE-AT子任务中,我们在10个提交中排名第六(在8个无约束系统中排名第四),在BCMS中,我们在13个提交中排名第四(在11个无约束系统中排名第二)。
{"title":"ZHAW-InIT - Social Media Geolocation at VarDial 2020","authors":"Fernando Benites, M. Hürlimann, Pius von Däniken, Mark Cieliebak","doi":"10.21256/ZHAW-21551","DOIUrl":"https://doi.org/10.21256/ZHAW-21551","url":null,"abstract":"We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems).","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects 在VarDial GDI 2017上:测试各种机器学习工具用于瑞士德语方言分类
Pub Date : 2017-04-03 DOI: 10.18653/v1/W17-1221
S. Clematide, Peter Makarov
Our submissions for the GDI 2017 Shared Task are the results from three different types of classifiers: Naïve Bayes, Conditional Random Fields (CRF), and Support Vector Machine (SVM). Our CRF-based run achieves a weighted F1 score of 65% (third rank) being beaten by the best system by 0.9%. Measured by classification accuracy, our ensemble run (Naïve Bayes, CRF, SVM) reaches 67% (second rank) being 1% lower than the best system. We also describe our experiments with Recurrent Neural Network (RNN) architectures. Since they performed worse than our non-neural approaches we did not include them in the submission.
我们提交的GDI 2017共享任务是三种不同类型分类器的结果:Naïve贝叶斯,条件随机场(CRF)和支持向量机(SVM)。我们基于crf的运行达到了65%的F1加权得分(第三名),被最好的系统击败了0.9%。通过分类准确率来衡量,我们的集成运行(Naïve Bayes, CRF, SVM)达到67%(第二等级),比最佳系统低1%。我们还描述了我们用递归神经网络(RNN)架构进行的实验。由于它们的表现比我们的非神经方法差,所以我们没有将它们包括在提交中。
{"title":"CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects","authors":"S. Clematide, Peter Makarov","doi":"10.18653/v1/W17-1221","DOIUrl":"https://doi.org/10.18653/v1/W17-1221","url":null,"abstract":"Our submissions for the GDI 2017 Shared Task are the results from three different types of classifiers: Naïve Bayes, Conditional Random Fields (CRF), and Support Vector Machine (SVM). Our CRF-based run achieves a weighted F1 score of 65% (third rank) being beaten by the best system by 0.9%. Measured by classification accuracy, our ensemble run (Naïve Bayes, CRF, SVM) reaches 67% (second rank) being 1% lower than the best system. We also describe our experiments with Recurrent Neural Network (RNN) architectures. Since they performed worse than our non-neural approaches we did not include them in the submission.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122315794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
The similarity and Mutual Intelligibility between Amharic and Tigrigna Varieties 阿姆哈拉语和提格里尼亚语变体的相似性和相互可理解性
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1206
Tekabe Legesse Feleke
The present study has examined the similarity and the mutual intelligibility between Amharic and Tigrigna using three tools namely Levenshtein distance, intelligibility test and questionnaires. The study has shown that both Tigrigna varieties have almost equal phonetic and lexical distances from Amharic. The study also indicated that Amharic speakers understand less than 50% of the two varieties. Furthermore, the study showed that Amharic speakers are more positive about the Ethiopian Tigrigna variety than the Eritrean Variety. However, their attitude towards the two varieties does not have an impact on their intelligibility. The Amharic speakers’ familiarity to the Tigrigna varieties is largely dependent on the genealogical relation between Amharic and the two Tigrigna varieties.
本研究采用Levenshtein距离、可理解性测试和问卷调查三种工具对阿姆哈拉语和蒂格里尼亚语的相似性和相互可理解性进行了研究。研究表明,这两个Tigrigna变种与阿姆哈拉语的语音和词汇距离几乎相等。该研究还表明,阿姆哈拉语使用者只能听懂不到50%的两种语言。此外,该研究表明,阿姆哈拉语使用者对埃塞俄比亚Tigrigna品种比厄立特里亚品种更积极。然而,他们对这两种变体的态度并不影响他们的可理解性。阿姆哈拉语使用者对Tigrigna变体的熟悉程度很大程度上取决于阿姆哈拉语和两个Tigrigna变体之间的系谱关系。
{"title":"The similarity and Mutual Intelligibility between Amharic and Tigrigna Varieties","authors":"Tekabe Legesse Feleke","doi":"10.18653/v1/W17-1206","DOIUrl":"https://doi.org/10.18653/v1/W17-1206","url":null,"abstract":"The present study has examined the similarity and the mutual intelligibility between Amharic and Tigrigna using three tools namely Levenshtein distance, intelligibility test and questionnaires. The study has shown that both Tigrigna varieties have almost equal phonetic and lexical distances from Amharic. The study also indicated that Amharic speakers understand less than 50% of the two varieties. Furthermore, the study showed that Amharic speakers are more positive about the Ethiopian Tigrigna variety than the Eritrean Variety. However, their attitude towards the two varieties does not have an impact on their intelligibility. The Amharic speakers’ familiarity to the Tigrigna varieties is largely dependent on the genealogical relation between Amharic and the two Tigrigna varieties.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126672106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets 基于BM25加权和不常用特征集的DSL任务特征图模型改进
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1214
Yves Bestgen
This paper describes the system developed by the Centre for English Corpus Linguistics (CECL) to discriminating similar languages, language varieties and dialects. Based on a SVM with character and POStag n-grams as features and the BM25 weighting scheme, it achieved 92.7% accuracy in the Discriminating between Similar Languages (DSL) task, ranking first among eleven systems but with a lead over the next three teams of only 0.2%. A simpler version of the system ranked second in the German Dialect Identification (GDI) task thanks to several ad hoc postprocessing steps. Complementary analyses carried out by a cross-validation procedure suggest that the BM25 weighting scheme could be competitive in this type of tasks, at least in comparison with the sublinear TF-IDF. POStag n-grams also improved the system performance.
本文介绍了由英语语料库语言学中心(CECL)开发的判别相似语言、语言变体和方言的系统。基于以字符和postg n-图为特征的支持向量机和BM25加权方案,在判别相似语言(DSL)任务中实现了92.7%的准确率,在11个系统中排名第一,但仅领先后三个团队0.2%。该系统的一个简单版本在德语方言识别(GDI)任务中排名第二,这要归功于几个特别的后处理步骤。通过交叉验证程序进行的补充分析表明,至少与次线性TF-IDF相比,BM25加权方案在这类任务中可能具有竞争力。postg n-gram也提高了系统性能。
{"title":"Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets","authors":"Yves Bestgen","doi":"10.18653/v1/W17-1214","DOIUrl":"https://doi.org/10.18653/v1/W17-1214","url":null,"abstract":"This paper describes the system developed by the Centre for English Corpus Linguistics (CECL) to discriminating similar languages, language varieties and dialects. Based on a SVM with character and POStag n-grams as features and the BM25 weighting scheme, it achieved 92.7% accuracy in the Discriminating between Similar Languages (DSL) task, ranking first among eleven systems but with a lead over the next three teams of only 0.2%. A simpler version of the system ranked second in the German Dialect Identification (GDI) task thanks to several ad hoc postprocessing steps. Complementary analyses carried out by a cross-validation procedure suggest that the BM25 weighting scheme could be competitive in this type of tasks, at least in comparison with the sublinear TF-IDF. POStag n-grams also improved the system performance.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131589157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Findings of the VarDial Evaluation Campaign 2017 VarDial评估活动2017的调查结果
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1201
Marcos Zampieri, S. Malmasi, Nikola Ljubesic, Preslav Nakov, Ahmed Ali, J. Tiedemann, Yves Scherrer, Noëmi Aepli
We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.
我们介绍了类似语言、变体和方言的自然语言处理(NLP) VarDial评估活动的结果,该活动是我们在EACL ' 2017第四届VarDial研讨会上组织的。今年,我们加入了四个共享任务:区分相似语言(DSL)、阿拉伯语方言识别(ADI)、德语方言识别(GDI)和跨语言依赖解析(CLP)。总共有19个团队提交了四个任务的运行,其中15个团队撰写了系统描述论文。
{"title":"Findings of the VarDial Evaluation Campaign 2017","authors":"Marcos Zampieri, S. Malmasi, Nikola Ljubesic, Preslav Nakov, Ahmed Ali, J. Tiedemann, Yves Scherrer, Noëmi Aepli","doi":"10.18653/v1/W17-1201","DOIUrl":"https://doi.org/10.18653/v1/W17-1201","url":null,"abstract":"We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127195579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 167
Investigating Diatopic Variation in a Historical Corpus
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1204
Stefanie Dipper, Sandra Waldenberger
This paper investigates diatopic variation in a historical corpus of German. Based on equivalent word forms from different language areas, replacement rules and mappings are derived which describe the relations between these word forms. These rules and mappings are then interpreted as reflections of morphological, phonological or graphemic variation. Based on sample rules and mappings, we show that our approach can replicate results from historical linguistics. While previous studies were restricted to predefined word lists, or confined to single authors or texts, our approach uses a much wider range of data available in historical corpora.
本文研究了德语历史语料库中的音位变化。基于不同语言区域的等价词形,推导出描述这些词形之间关系的替换规则和映射。然后,这些规则和映射被解释为形态、语音或字母变化的反映。基于样本规则和映射,我们证明了我们的方法可以复制历史语言学的结果。虽然以前的研究仅限于预定义的单词列表,或者局限于单个作者或文本,但我们的方法使用了历史语料库中更广泛的可用数据。
{"title":"Investigating Diatopic Variation in a Historical Corpus","authors":"Stefanie Dipper, Sandra Waldenberger","doi":"10.18653/v1/W17-1204","DOIUrl":"https://doi.org/10.18653/v1/W17-1204","url":null,"abstract":"This paper investigates diatopic variation in a historical corpus of German. Based on equivalent word forms from different language areas, replacement rules and mappings are derived which describe the relations between these word forms. These rules and mappings are then interpreted as reflections of morphological, phonological or graphemic variation. Based on sample rules and mappings, we show that our approach can replicate results from historical linguistics. While previous studies were restricted to predefined word lists, or confined to single authors or texts, our approach uses a much wider range of data available in historical corpora.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124974357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning to Identify Arabic and German Dialects using Multiple Kernels 学习识别阿拉伯语和德语方言使用多核
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1225
Radu Tudor Ionescu, Andrei M. Butnaru
We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).
我们提出了一种机器学习方法,用于DSL 2017挑战赛的阿拉伯语方言识别(ADI)和德语方言识别(GDI)封闭共享任务。提出的方法使用多核学习将多个核结合起来。虽然我们的大多数核都是基于从语音记录中提取的字符p-grams(也称为n-grams),但我们也使用基于i-vectors的核,这是一种音频记录的低维表示,仅为阿拉伯语数据提供。在学习阶段,我们独立使用核判别分析(KDA)和核岭回归(KRR)。我们的方法是肤浅和简单的,但在共享任务中获得的经验结果证明,它取得了很好的效果。事实上,我们在ADI共享任务中排名第一,F1加权得分为76.32%(比第二名高4.62%),在GDI共享任务中排名第五,F1加权得分为63.67%(比第一名低2.57%)。
{"title":"Learning to Identify Arabic and German Dialects using Multiple Kernels","authors":"Radu Tudor Ionescu, Andrei M. Butnaru","doi":"10.18653/v1/W17-1225","DOIUrl":"https://doi.org/10.18653/v1/W17-1225","url":null,"abstract":"We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"153 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120949442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Kurdish Interdialect Machine Translation 库尔德语跨方言机器翻译
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1208
Hossein Hassani
This research suggests a method for machine translation among two Kurdish dialects. We chose the two widely spoken dialects, Kurmanji and Sorani, which are considered to be mutually unintelligible. Also, despite being spoken by about 30 million people in different countries, Kurdish is among less-resourced languages. The research used bi-dialectal dictionaries and showed that the lack of parallel corpora is not a major obstacle in machine translation between the two dialects. The experiments showed that the machine translated texts are comprehensible to those who do not speak the dialect. The research is the first attempt for inter-dialect machine translation in Kurdish and particularly could help in making online texts in one dialect comprehensible to those who only speak the target dialect. The results showed that the translated texts are in 71% and 79% cases rated as understandable for Kurmanji and Sorani respectively. They are rated as slightly-understandable in 29% cases for Kurmanji and 21% for Sorani.
本研究提出了一种库尔德语两种方言之间的机器翻译方法。我们选择了两种广泛使用的方言,库尔曼吉语和索拉尼语,这两种方言被认为是相互无法理解的。此外,尽管在不同的国家有大约3000万人使用库尔德语,但库尔德语是资源较少的语言之一。本研究使用双方言词典,结果表明缺乏平行语料库并不是两种方言机器翻译的主要障碍。实验表明,机器翻译的文本对于不会说方言的人来说是可以理解的。这项研究是库尔德语跨方言机器翻译的第一次尝试,特别是可以帮助那些只说目标方言的人理解一种方言的在线文本。结果表明,库尔曼吉语和索拉尼语的翻译文本分别有71%和79%被评为可理解。29%的库尔曼吉和21%的索拉尼被认为是可以理解的。
{"title":"Kurdish Interdialect Machine Translation","authors":"Hossein Hassani","doi":"10.18653/v1/W17-1208","DOIUrl":"https://doi.org/10.18653/v1/W17-1208","url":null,"abstract":"This research suggests a method for machine translation among two Kurdish dialects. We chose the two widely spoken dialects, Kurmanji and Sorani, which are considered to be mutually unintelligible. Also, despite being spoken by about 30 million people in different countries, Kurdish is among less-resourced languages. The research used bi-dialectal dictionaries and showed that the lack of parallel corpora is not a major obstacle in machine translation between the two dialects. The experiments showed that the machine translated texts are comprehensible to those who do not speak the dialect. The research is the first attempt for inter-dialect machine translation in Kurdish and particularly could help in making online texts in one dialect comprehensible to those who only speak the target dialect. The results showed that the translated texts are in 71% and 79% cases rated as understandable for Kurmanji and Sorani respectively. They are rated as slightly-understandable in 29% cases for Kurmanji and 21% for Sorani.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133575145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words 使用类型化和非类型化字符n -图和词的组合来区分相似语言
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1217
Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto
This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
本文介绍了在VarDial 2017研讨会上参加区分相似语言(DSL)共享任务的cic_ualg系统。今年的任务旨在利用新闻文本节选语料库识别6个语言群体中的14种语言。比较了两种分类方法:单步(所有语言)方法和两步(语言组和组内语言)方法。被利用的特征包括词汇特征(单字)和字符n-图。除了传统的(非类型化的)字符n-图之外,我们在DSL任务中引入了类型化的字符n-图。实验使用不同的特征表示方法(二进制和原始项频率)、频率阈值和机器学习算法-支持向量机(SVM)和多项朴素贝叶斯(MNB)进行。我们在DSL任务中的最佳运行达到了91.46%的准确率。
{"title":"Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words","authors":"Helena Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, David Pinto","doi":"10.18653/v1/W17-1217","DOIUrl":"https://doi.org/10.18653/v1/W17-1217","url":null,"abstract":"This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134509536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
期刊
Workshop on NLP for Similar Languages, Varieties and Dialects
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1