SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification

Proceedings of the Sixth Workshop on Pub Date : 2019-06-01 DOI:10.18653/v1/W19-1418

Cristian Onose, Dumitru-Clementin Cercel, Stefan Trausan-Matu

{"title":"SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification","authors":"Cristian Onose, Dumitru-Clementin Cercel, Stefan Trausan-Matu","doi":"10.18653/v1/W19-1418","DOIUrl":null,"url":null,"abstract":"This paper describes our models for the Moldavian vs. Romanian Cross-Topic Identification (MRC) evaluation campaign, part of the VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification between the Moldavian (MD) and the Romanian (RO) dialects and two cross-dialect multi-class classification between six news topics, MD to RO and RO to MD. We propose several deep learning models based on long short-term memory cells, Bidirectional Gated Recurrent Unit (BiGRU) and Hierarchical Attention Networks (HAN). We also employ three word embedding models to represent the text as a low dimensional vector. Our official submission includes two runs of the BiGRU and HAN models for each of the three subtasks. The best submitted model obtained the following macro-averaged F1 scores: 0.708 for subtask 1, 0.481 for subtask 2 and 0.480 for the last one. Due to a read error caused by the quoting behaviour over the test file, our final submissions contained a smaller number of items than expected. More than 50% of the submission files were corrupted. Thus, we also present the results obtained with the corrected labels for which the HAN model achieves the following results: 0.930 for subtask 1, 0.590 for subtask 2 and 0.687 for the third one.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"91 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth Workshop on","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-1418","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

This paper describes our models for the Moldavian vs. Romanian Cross-Topic Identification (MRC) evaluation campaign, part of the VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification between the Moldavian (MD) and the Romanian (RO) dialects and two cross-dialect multi-class classification between six news topics, MD to RO and RO to MD. We propose several deep learning models based on long short-term memory cells, Bidirectional Gated Recurrent Unit (BiGRU) and Hierarchical Attention Networks (HAN). We also employ three word embedding models to represent the text as a low dimensional vector. Our official submission includes two runs of the BiGRU and HAN models for each of the three subtasks. The best submitted model obtained the following macro-averaged F1 scores: 0.708 for subtask 1, 0.481 for subtask 2 and 0.480 for the last one. Due to a read error caused by the quoting behaviour over the test file, our final submissions contained a smaller number of items than expected. More than 50% of the submission files were corrupted. Thus, we also present the results obtained with the corrected labels for which the HAN model achieves the following results: 0.930 for subtask 1, 0.590 for subtask 2 and 0.687 for the third one.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SC-UPB在VarDial 2019评估活动:摩尔多瓦语与罗马尼亚语跨方言主题识别

本文描述了我们对摩尔多瓦与罗马尼亚跨主题识别(MRC)评估活动的模型，这是VarDial 2019研讨会的一部分。我们重点研究了MRC的三个子任务:摩尔多瓦语(MD)和罗马尼亚语(RO)方言之间的二元分类，以及六个新闻主题之间的两个跨方言多类分类，MD到RO和RO到MD。我们提出了几种基于长短期记忆细胞、双向门控循环单元(BiGRU)和分层注意网络(HAN)的深度学习模型。我们还使用了三个词嵌入模型来将文本表示为低维向量。我们的正式提交包括为三个子任务中的每一个运行BiGRU和HAN模型。提交最好的模型得到的宏观平均F1分数如下:子任务1为0.708，子任务2为0.481，最后一个为0.480。由于在测试文件上引用行为导致的读取错误，我们最终提交的项目比预期的要少。超过50%的提交文件已损坏。因此，我们也给出了使用修正标签获得的结果，其中HAN模型实现了以下结果:子任务1为0.930，子任务2为0.590，第三个子任务为0.687。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Sixth Workshop on

自引率

0.00%

发文量

期刊最新文献

Joint Approach to Deromanization of Code-mixed Texts Cross-lingual Annotation Projection Is Effective for Neural Part-of-Speech Tagging TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019 Ensemble Methods to Distinguish Mainland and Taiwan Chinese A Report on the Third VarDial Evaluation Campaign