首页 > 最新文献

Proceedings of the Sixth Workshop on最新文献

英文 中文
Cross-lingual Annotation Projection Is Effective for Neural Part-of-Speech Tagging 跨语言标注投影是神经词性标注的有效方法
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1425
Matthias Huck, Diana Dutka, Alexander M. Fraser
We tackle the important task of part-of-speech tagging using a neural model in the zero-resource scenario, where we have no access to gold-standard POS training data. We compare this scenario with the low-resource scenario, where we have access to a small amount of gold-standard POS training data. Our experiments focus on Ukrainian as a representative of under-resourced languages. Russian is highly related to Ukrainian, so we exploit gold-standard Russian POS tags. We consider four techniques to perform Ukrainian POS tagging: zero-shot tagging and cross-lingual annotation projection (for the zero-resource scenario), and compare these with self-training and multilingual learning (for the low-resource scenario). We find that cross-lingual annotation projection works particularly well in the zero-resource scenario.
我们在零资源场景中使用神经模型处理词性标注的重要任务,在这种情况下,我们无法访问金标准的POS训练数据。我们将此场景与低资源场景进行比较,在低资源场景中,我们可以访问少量金标准POS训练数据。我们的实验集中在乌克兰语上,作为资源不足语言的代表。俄语与乌克兰语高度相关,所以我们使用黄金标准的俄语POS标签。我们考虑了四种技术来执行乌克兰语POS标注:零拍摄标注和跨语言标注投影(用于零资源场景),并将它们与自我训练和多语言学习(用于低资源场景)进行比较。我们发现跨语言注释投影在零资源场景下工作得特别好。
{"title":"Cross-lingual Annotation Projection Is Effective for Neural Part-of-Speech Tagging","authors":"Matthias Huck, Diana Dutka, Alexander M. Fraser","doi":"10.18653/v1/W19-1425","DOIUrl":"https://doi.org/10.18653/v1/W19-1425","url":null,"abstract":"We tackle the important task of part-of-speech tagging using a neural model in the zero-resource scenario, where we have no access to gold-standard POS training data. We compare this scenario with the low-resource scenario, where we have access to a small amount of gold-standard POS training data. Our experiments focus on Ukrainian as a representative of under-resourced languages. Russian is highly related to Ukrainian, so we exploit gold-standard Russian POS tags. We consider four techniques to perform Ukrainian POS tagging: zero-shot tagging and cross-lingual annotation projection (for the zero-resource scenario), and compare these with self-training and multilingual learning (for the low-resource scenario). We find that cross-lingual annotation projection works particularly well in the zero-resource scenario.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"IE-33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120999981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification SC-UPB在VarDial 2019评估活动:摩尔多瓦语与罗马尼亚语跨方言主题识别
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1418
Cristian Onose, Dumitru-Clementin Cercel, Stefan Trausan-Matu
This paper describes our models for the Moldavian vs. Romanian Cross-Topic Identification (MRC) evaluation campaign, part of the VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification between the Moldavian (MD) and the Romanian (RO) dialects and two cross-dialect multi-class classification between six news topics, MD to RO and RO to MD. We propose several deep learning models based on long short-term memory cells, Bidirectional Gated Recurrent Unit (BiGRU) and Hierarchical Attention Networks (HAN). We also employ three word embedding models to represent the text as a low dimensional vector. Our official submission includes two runs of the BiGRU and HAN models for each of the three subtasks. The best submitted model obtained the following macro-averaged F1 scores: 0.708 for subtask 1, 0.481 for subtask 2 and 0.480 for the last one. Due to a read error caused by the quoting behaviour over the test file, our final submissions contained a smaller number of items than expected. More than 50% of the submission files were corrupted. Thus, we also present the results obtained with the corrected labels for which the HAN model achieves the following results: 0.930 for subtask 1, 0.590 for subtask 2 and 0.687 for the third one.
本文描述了我们对摩尔多瓦与罗马尼亚跨主题识别(MRC)评估活动的模型,这是VarDial 2019研讨会的一部分。我们重点研究了MRC的三个子任务:摩尔多瓦语(MD)和罗马尼亚语(RO)方言之间的二元分类,以及六个新闻主题之间的两个跨方言多类分类,MD到RO和RO到MD。我们提出了几种基于长短期记忆细胞、双向门控循环单元(BiGRU)和分层注意网络(HAN)的深度学习模型。我们还使用了三个词嵌入模型来将文本表示为低维向量。我们的正式提交包括为三个子任务中的每一个运行BiGRU和HAN模型。提交最好的模型得到的宏观平均F1分数如下:子任务1为0.708,子任务2为0.481,最后一个为0.480。由于在测试文件上引用行为导致的读取错误,我们最终提交的项目比预期的要少。超过50%的提交文件已损坏。因此,我们也给出了使用修正标签获得的结果,其中HAN模型实现了以下结果:子任务1为0.930,子任务2为0.590,第三个子任务为0.687。
{"title":"SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification","authors":"Cristian Onose, Dumitru-Clementin Cercel, Stefan Trausan-Matu","doi":"10.18653/v1/W19-1418","DOIUrl":"https://doi.org/10.18653/v1/W19-1418","url":null,"abstract":"This paper describes our models for the Moldavian vs. Romanian Cross-Topic Identification (MRC) evaluation campaign, part of the VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification between the Moldavian (MD) and the Romanian (RO) dialects and two cross-dialect multi-class classification between six news topics, MD to RO and RO to MD. We propose several deep learning models based on long short-term memory cells, Bidirectional Gated Recurrent Unit (BiGRU) and Hierarchical Attention Networks (HAN). We also employ three word embedding models to represent the text as a low dimensional vector. Our official submission includes two runs of the BiGRU and HAN models for each of the three subtasks. The best submitted model obtained the following macro-averaged F1 scores: 0.708 for subtask 1, 0.481 for subtask 2 and 0.480 for the last one. Due to a read error caused by the quoting behaviour over the test file, our final submissions contained a smaller number of items than expected. More than 50% of the submission files were corrupted. Thus, we also present the results obtained with the corrected labels for which the HAN model achieves the following results: 0.930 for subtask 1, 0.590 for subtask 2 and 0.687 for the third one.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130744098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Ensemble Methods to Distinguish Mainland and Taiwan Chinese 大陆汉语与台湾汉语的集成方法辨析
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1417
Hai Hu, Wen Li, He Zhou, Zuoyu Tian, Yiwen Zhang, Liang Zou
This paper describes the IUCL system at VarDial 2019 evaluation campaign for the task of discriminating between Mainland and Taiwan variation of mandarin Chinese. We first build several base classifiers, including a Naive Bayes classifier with word n-gram as features, SVMs with both character and syntactic features, and neural networks with pre-trained character/word embeddings. Then we adopt ensemble methods to combine output from base classifiers to make final predictions. Our ensemble models achieve the highest F1 score (0.893) in simplified Chinese track and the second highest (0.901) in traditional Chinese track. Our results demonstrate the effectiveness and robustness of the ensemble methods.
本文描述了IUCL系统在VarDial 2019评估活动中用于区分大陆和台湾普通话变体的任务。我们首先构建了几个基本分类器,包括一个以单词n-gram为特征的朴素贝叶斯分类器,同时具有字符和句法特征的支持向量机,以及具有预训练字符/词嵌入的神经网络。然后采用集成方法将基分类器的输出组合起来进行最终预测。我们的集成模型在简体中文赛道上F1得分最高(0.893),在繁体中文赛道上F1得分第二高(0.901)。我们的结果证明了集成方法的有效性和鲁棒性。
{"title":"Ensemble Methods to Distinguish Mainland and Taiwan Chinese","authors":"Hai Hu, Wen Li, He Zhou, Zuoyu Tian, Yiwen Zhang, Liang Zou","doi":"10.18653/v1/W19-1417","DOIUrl":"https://doi.org/10.18653/v1/W19-1417","url":null,"abstract":"This paper describes the IUCL system at VarDial 2019 evaluation campaign for the task of discriminating between Mainland and Taiwan variation of mandarin Chinese. We first build several base classifiers, including a Naive Bayes classifier with word n-gram as features, SVMs with both character and syntactic features, and neural networks with pre-trained character/word embeddings. Then we adopt ensemble methods to combine output from base classifiers to make final predictions. Our ensemble models achieve the highest F1 score (0.893) in simplified Chinese track and the second highest (0.901) in traditional Chinese track. Our results demonstrate the effectiveness and robustness of the ensemble methods.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122975589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Toward a deep dialectological representation of Indo-Aryan 走向印度-雅利安语的深层方言表征
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1411
C. Cathcart
This paper presents a new approach to disentangling inter-dialectal and intra-dialectal relationships within one such group, the Indo-Aryan subgroup of Indo-European. We draw upon admixture models and deep generative models to tease apart historic language contact and language-specific behavior in the overall patterns of sound change displayed by Indo-Aryan languages. We show that a “deep” model of Indo-Aryan dialectology sheds some light on questions regarding inter-relationships among the Indo-Aryan languages, and performs better than a “shallow” model in terms of certain qualities of the posterior distribution (e.g., entropy of posterior distributions), and outline future pathways for model development.
本文提出了一种新的方法来解开这样一个群体中的方言间和方言内关系,即印欧语系的印度-雅利安亚群。我们利用混合模型和深度生成模型来梳理印度雅利安语言中声音变化的整体模式中的历史语言接触和语言特定行为。我们表明,印度-雅利安方言的“深度”模型揭示了一些关于印度-雅利安语言之间相互关系的问题,并且在后验分布的某些品质(例如,后验分布的熵)方面比“浅”模型表现得更好,并概述了模型发展的未来途径。
{"title":"Toward a deep dialectological representation of Indo-Aryan","authors":"C. Cathcart","doi":"10.18653/v1/W19-1411","DOIUrl":"https://doi.org/10.18653/v1/W19-1411","url":null,"abstract":"This paper presents a new approach to disentangling inter-dialectal and intra-dialectal relationships within one such group, the Indo-Aryan subgroup of Indo-European. We draw upon admixture models and deep generative models to tease apart historic language contact and language-specific behavior in the overall patterns of sound change displayed by Indo-Aryan languages. We show that a “deep” model of Indo-Aryan dialectology sheds some light on questions regarding inter-relationships among the Indo-Aryan languages, and performs better than a “shallow” model in terms of certain qualities of the posterior distribution (e.g., entropy of posterior distributions), and outline future pathways for model development.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131938101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Report on the Third VarDial Evaluation Campaign 第三次VarDial评价活动报告
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1401
Marcos Zampieri, S. Malmasi, Yves Scherrer, T. Samardžić, Francis M. Tyers, Miikka Silfverberg, N. Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, T. Jauhiainen
In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.
在本文中,我们介绍了第三次VarDial评估活动的结果,该活动是与NAACL 2019同地举办的第六届类似语言、变体和方言(VarDial)自然语言处理(NLP)研讨会的一部分。今年的活动共设五项共同任务,包括一项重做的任务——德语方言识别(GDI),以及四项新任务——跨语言词形分析(CMA)、大陆与台湾普通话变体的区分(DMT)、摩尔多瓦语与罗马尼亚语跨方言主题识别(MRC),以及楔形文字识别(CLI)。总共有22个团队提交了跨越5个共享任务的运行。比赛结束后,我们收到了14篇系统描述论文,这些论文发表在VarDial研讨会论文集中,并在本报告中引用。
{"title":"A Report on the Third VarDial Evaluation Campaign","authors":"Marcos Zampieri, S. Malmasi, Yves Scherrer, T. Samardžić, Francis M. Tyers, Miikka Silfverberg, N. Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, T. Jauhiainen","doi":"10.18653/v1/W19-1401","DOIUrl":"https://doi.org/10.18653/v1/W19-1401","url":null,"abstract":"In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127783261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Joint Approach to Deromanization of Code-mixed Texts 语码混合语篇非罗曼化的联合研究
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1403
Rashed Rubby Riyadh, Grzegorz Kondrak
The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.
由于不一致的罗马化约定和非标准的语言使用,将罗马化文本转换回本地脚本是一项具有挑战性的任务。代码混合(即在同一话语中使用一种以上语言的单词)使这个问题更加复杂。在本文中,我们提出了一种在单一系统中同时处理这两个问题的新方法。我们的方法结合了三个组成部分:语言识别、反音译和序列预测。我们在孟加拉语和印地语数据集上的实验结果为代码混合文本的非罗马化任务建立了最先进的技术。
{"title":"Joint Approach to Deromanization of Code-mixed Texts","authors":"Rashed Rubby Riyadh, Grzegorz Kondrak","doi":"10.18653/v1/W19-1403","DOIUrl":"https://doi.org/10.18653/v1/W19-1403","url":null,"abstract":"The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114159413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019 TwistBytes -识别楔形文字和德语方言在VarDial 2019
Pub Date : 2019-06-01 DOI: 10.18653/v1/W19-1421
Fernando Benites, P. von Däniken, Mark Cieliebak
We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%.
我们在2019年VarDial评估活动中描述了我们用于德语方言识别(GDI)和楔形文字识别(CLI)任务的方法。目标是识别GDI中的瑞士德语方言和CLI中的苏美尔语和阿卡德语方言。在GDI中,该系统应该区分瑞士德语区的四种方言。我们的GDI系统在6支队伍中排名第三,宏观平均F-1为74.6%。在CLI中,系统需要区分七种楔形文字。我们的系统在8支队伍中排名第三,宏观平均F-1为74.7%。
{"title":"TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019","authors":"Fernando Benites, P. von Däniken, Mark Cieliebak","doi":"10.18653/v1/W19-1421","DOIUrl":"https://doi.org/10.18653/v1/W19-1421","url":null,"abstract":"We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121343341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models 运用自适应语言模型对普通话和瑞士德语进行区分
Pub Date : 2019-04-30 DOI: 10.18653/v1/W19-1419
T. Jauhiainen, Krister Lindén, H. Jauhiainen
This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign. The DMT shared task included two separate tracks, one for the simplified Chinese script and one for the traditional Chinese script. We submitted three runs on both tracks of the DMT task as well as on the GDI task. We won the traditional Chinese track using Naive Bayes with language model adaptation, came second on GDI with an adaptive version of the HeLI 2.0 method, and third on the simplified Chinese track using again the adaptive Naive Bayes.
本文描述了在第三次VarDial评估活动中,SUKI团队在区分普通话大陆和台湾变体(DMT)和德语方言识别(GDI)共享任务中使用的语言识别系统。DMT共享任务包括两个独立的轨道,一个用于简体中文,一个用于繁体中文。我们在DMT任务和GDI任务的两个轨道上提交了三次运行。我们使用具有语言模型自适应的朴素贝叶斯获得了繁体中文赛道的冠军,使用自适应版本的HeLI 2.0方法获得了GDI第二名,使用自适应朴素贝叶斯获得了简体中文赛道的第三名。
{"title":"Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models","authors":"T. Jauhiainen, Krister Lindén, H. Jauhiainen","doi":"10.18653/v1/W19-1419","DOIUrl":"https://doi.org/10.18653/v1/W19-1419","url":null,"abstract":"This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign. The DMT shared task included two separate tracks, one for the simplified Chinese script and one for the traditional Chinese script. We submitted three runs on both tracks of the DMT task as well as on the GDI task. We won the traditional Chinese track using Naive Bayes with language model adaptation, came second on GDI with an adaptive version of the HeLI 2.0 method, and third on the simplified Chinese track using again the adaptive Naive Bayes.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130398861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation 相似语言的语言辨别与迁移学习:特征组合与适应实验
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1406
Nianheng Wu, Eric DeMattos, Kwok Him So, Pin-zhen Chen, Çagri Çöltekin
This paper describes the work done by team tearsofjoy participating in the VarDial 2019 Evaluation Campaign. We developed two systems based on Support Vector Machines: SVM with a flat combination of features and SVM ensembles. We participated in all language/dialect identification tasks, as well as the Moldavian vs. Romanian cross-dialect topic identification (MRC) task. Our team achieved first place in German Dialect identification (GDI) and MRC subtasks 2 and 3, second place in the simplified variant of Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT) as well as Cuneiform Language Identification (CLI), and third and fifth place in DMT traditional and MRC subtask 1 respectively. In most cases, the SVM with a flat combination of features performed better than SVM ensembles. Besides describing the systems and the results obtained by them, we provide a tentative comparison between the feature combination methods, and present additional experiments with a method of adaptation to the test set, which may indicate potential pitfalls with some of the data sets.
本文描述了tearsofjoy团队参加VarDial 2019评估活动所做的工作。我们开发了两个基于支持向量机的系统:具有平坦特征组合的支持向量机和支持向量机集合。我们参与了所有语言/方言识别任务,以及摩尔多瓦语和罗马尼亚语跨方言主题识别(MRC)任务。我们的团队在德语方言识别(GDI)和MRC子任务2和3中获得第一名,在普通话大陆和台湾变体的区分(DMT)简化变体和楔形文字识别(CLI)中分别获得第二名,在DMT传统和MRC子任务1中分别获得第三名和第五名。在大多数情况下,具有平坦特征组合的支持向量机比支持向量机集成的性能更好。除了描述系统及其获得的结果外,我们还提供了特征组合方法之间的初步比较,并提供了一种适应测试集的方法的附加实验,这可能表明某些数据集存在潜在的缺陷。
{"title":"Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation","authors":"Nianheng Wu, Eric DeMattos, Kwok Him So, Pin-zhen Chen, Çagri Çöltekin","doi":"10.18653/v1/W19-1406","DOIUrl":"https://doi.org/10.18653/v1/W19-1406","url":null,"abstract":"This paper describes the work done by team tearsofjoy participating in the VarDial 2019 Evaluation Campaign. We developed two systems based on Support Vector Machines: SVM with a flat combination of features and SVM ensembles. We participated in all language/dialect identification tasks, as well as the Moldavian vs. Romanian cross-dialect topic identification (MRC) task. Our team achieved first place in German Dialect identification (GDI) and MRC subtasks 2 and 3, second place in the simplified variant of Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT) as well as Cuneiform Language Identification (CLI), and third and fifth place in DMT traditional and MRC subtask 1 respectively. In most cases, the SVM with a flat combination of features performed better than SVM ensembles. Besides describing the systems and the results obtained by them, we provide a tentative comparison between the feature combination methods, and present additional experiments with a method of adaptation to the test set, which may indicate potential pitfalls with some of the data sets.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128645085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
BAM: A combination of deep and shallow models for German Dialect Identification. 德语方言识别的深层和浅层模型的结合。
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1413
Andrei M. Butnaru
*This is a submission for the Third VarDial Evaluation Campaign* In this paper, we present a machine learning approach for the German Dialect Identification (GDI) Closed Shared Task of the DSL 2019 Challenge. The proposed approach combines deep and shallow models, by applying a voting scheme on the outputs resulted from a Character-level Convolutional Neural Networks (Char-CNN), a Long Short-Term Memory (LSTM) network, and a model based on String Kernels. The first model used is the Char-CNN model that merges multiple convolutions computed with kernels of different sizes. The second model is the LSTM network which applies a global max pooling over the returned sequences over time. Both models pass the activation maps to two fully-connected layers. The final model is based on String Kernels, computed on character p-grams extracted from speech transcripts. The model combines two blended kernel functions, one is the presence bits kernel, and the other is the intersection kernel. The empirical results obtained in the shared task prove that the approach can achieve good results. The system proposed in this paper obtained the fourth place with a macro-F1 score of 62.55%
在本文中,我们提出了一种用于DSL 2019挑战赛德语方言识别(GDI)封闭共享任务的机器学习方法。该方法通过对字符级卷积神经网络(Char-CNN)、长短期记忆(LSTM)网络和基于字符串核的模型的输出应用投票方案,将深度和浅层模型相结合。使用的第一个模型是Char-CNN模型,该模型合并了使用不同大小的核计算的多个卷积。第二个模型是LSTM网络,它随着时间的推移对返回的序列应用全局最大池化。两个模型都将激活映射传递给两个完全连接的层。最后一个模型是基于字符串核,计算从语音文本中提取的字符p图。该模型结合了两个混合核函数,一个是存在位核函数,另一个是交集核函数。在共享任务中获得的经验结果证明,该方法可以取得良好的效果。本文提出的系统以62.55%的宏观f1得分获得第四名
{"title":"BAM: A combination of deep and shallow models for German Dialect Identification.","authors":"Andrei M. Butnaru","doi":"10.18653/v1/W19-1413","DOIUrl":"https://doi.org/10.18653/v1/W19-1413","url":null,"abstract":"*This is a submission for the Third VarDial Evaluation Campaign* In this paper, we present a machine learning approach for the German Dialect Identification (GDI) Closed Shared Task of the DSL 2019 Challenge. The proposed approach combines deep and shallow models, by applying a voting scheme on the outputs resulted from a Character-level Convolutional Neural Networks (Char-CNN), a Long Short-Term Memory (LSTM) network, and a model based on String Kernels. The first model used is the Char-CNN model that merges multiple convolutions computed with kernels of different sizes. The second model is the LSTM network which applies a global max pooling over the returned sequences over time. Both models pass the activation maps to two fully-connected layers. The final model is based on String Kernels, computed on character p-grams extracted from speech transcripts. The model combines two blended kernel functions, one is the presence bits kernel, and the other is the intersection kernel. The empirical results obtained in the shared task prove that the approach can achieve good results. The system proposed in this paper obtained the fourth place with a macro-F1 score of 62.55%","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the Sixth Workshop on
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1