利用BERT改进楔形文字识别

Proceedings of the Sixth Workshop on Pub Date : 1900-01-01 DOI:10.18653/v1/W19-1402

Gabriel Bernier-Colborne, Cyril Goutte, Serge Léger

{"title":"利用BERT改进楔形文字识别","authors":"Gabriel Bernier-Colborne, Cyril Goutte, Serge Léger","doi":"10.18653/v1/W19-1402","DOIUrl":null,"url":null,"abstract":"We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Improving Cuneiform Language Identification with BERT\",\"authors\":\"Gabriel Bernier-Colborne, Cyril Goutte, Serge Léger\",\"doi\":\"10.18653/v1/W19-1402\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.\",\"PeriodicalId\":344344,\"journal\":{\"name\":\"Proceedings of the Sixth Workshop on\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Sixth Workshop on\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W19-1402\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth Workshop on","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-1402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

我们描述了加拿大国家研究委员会在2019年VarDial评估活动中为楔形文字识别(CLI)共享任务开发的系统。我们比较了基于字符n-图的最先进的基线和传统的统计分类器、分类器的投票集合和使用Transformer网络的深度学习方法。我们描述了这些系统是如何训练的，并分析了一些预处理和模型估计决策的影响。深度神经网络在测试数据上达到了77%的准确率，这在CLI评估中被证明是最好的表现，为楔形文字识别建立了新的技术水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Improving Cuneiform Language Identification with BERT

We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Sixth Workshop on

自引率

0.00%

发文量

期刊最新文献

Joint Approach to Deromanization of Code-mixed Texts Cross-lingual Annotation Projection Is Effective for Neural Part-of-Speech Tagging TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019 Ensemble Methods to Distinguish Mainland and Taiwan Chinese A Report on the Third VarDial Evaluation Campaign