首页 > 最新文献

Workshop on NLP for Similar Languages, Varieties and Dialects最新文献

英文 中文
When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages 当稀疏传统模型优于密集神经网络:区分相似语言的奇怪案例
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1219
M. Medvedeva, Martin Kroon, Barbara Plank
We present the results of our participation in the VarDial 4 shared task on discriminating closely related languages. Our submission includes simple traditional models using linear support vector machines (SVMs) and a neural network (NN). The main idea was to leverage language group information. We did so with a two-layer approach in the traditional model and a multi-task objective in the neural network case. Our results confirm earlier findings: simple traditional models outperform neural networks consistently for this task, at least given the amount of systems we could examine in the available time. Our two-layer linear SVM ranked 2nd in the shared task.
我们介绍了我们参与VarDial 4关于区分密切相关语言的共享任务的结果。我们的提交包括使用线性支持向量机(svm)和神经网络(NN)的简单传统模型。主要的想法是利用语言群体的信息。我们在传统模型中采用两层方法,在神经网络案例中采用多任务目标。我们的结果证实了早期的发现:简单的传统模型在这项任务中始终优于神经网络,至少考虑到我们可以在可用时间内检查的系统数量。我们的双层线性支持向量机在共享任务中排名第二。
{"title":"When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages","authors":"M. Medvedeva, Martin Kroon, Barbara Plank","doi":"10.18653/v1/W17-1219","DOIUrl":"https://doi.org/10.18653/v1/W17-1219","url":null,"abstract":"We present the results of our participation in the VarDial 4 shared task on discriminating closely related languages. Our submission includes simple traditional models using linear support vector machines (SVMs) and a neural network (NN). The main idea was to leverage language group information. We did so with a two-layer approach in the traditional model and a multi-task objective in the neural network case. Our results confirm earlier findings: simple traditional models outperform neural networks consistently for this task, at least given the amount of systems we could examine in the available time. Our two-layer linear SVM ranked 2nd in the shared task.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131407009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Cross-lingual dependency parsing for closely related languages - Helsinki’s submission to VarDial 2017 密切相关语言的跨语言依赖解析——赫尔辛基提交给VarDial 2017
Pub Date : 2017-04-01 DOI: 10.18653/v1/W17-1216
J. Tiedemann
This paper describes the submission from the University of Helsinki to the shared task on cross-lingual dependency parsing at VarDial 2017. We present work on annotation projection and treebank translation that gave good results for all three target languages in the test set. In particular, Slovak seems to work well with information coming from the Czech treebank, which is in line with related work. The attachment scores for cross-lingual models even surpass the fully supervised models trained on the target language treebank. Croatian is the most difficult language in the test set and the improvements over the baseline are rather modest. Norwegian works best with information coming from Swedish whereas Danish contributes surprisingly little.
本文描述了赫尔辛基大学在VarDial 2017上提交的跨语言依赖解析的共享任务。我们介绍了在注释投影和树库翻译方面的工作,在测试集中为所有三种目标语言提供了良好的结果。特别是,斯洛伐克语似乎可以很好地处理来自捷克树库的信息,这与相关工作是一致的。跨语言模型的附件分数甚至超过了在目标语言树库上训练的完全监督模型。克罗地亚语是测试中最难的语言,在基线上的进步相当有限。挪威语对来自瑞典语的信息处理得最好,而丹麦语的贡献却少得惊人。
{"title":"Cross-lingual dependency parsing for closely related languages - Helsinki’s submission to VarDial 2017","authors":"J. Tiedemann","doi":"10.18653/v1/W17-1216","DOIUrl":"https://doi.org/10.18653/v1/W17-1216","url":null,"abstract":"This paper describes the submission from the University of Helsinki to the shared task on cross-lingual dependency parsing at VarDial 2017. We present work on annotation projection and treebank translation that gave good results for all three target languages in the test set. In particular, Slovak seems to work well with information coming from the Czech treebank, which is in line with related work. The attachment scores for cross-lingual models even surpass the fully supervised models trained on the target language treebank. Croatian is the most difficult language in the test set and the improvements over the baseline are rather modest. Norwegian works best with information coming from Swedish whereas Danish contributes surprisingly little.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134514766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Arabic Dialect Identification Using iVectors and ASR Transcripts 用向量和ASR转录本识别阿拉伯语方言
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1222
S. Malmasi, Marcos Zampieri
This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects: Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7% accuracy, ranking second among the six teams which participated in the ADI shared task.
本文介绍了MAZA团队提交给VarDial评估活动2017年阿拉伯语方言识别(ADI)共享任务的系统。该任务的目标是评估计算模型,以识别使用音频和文本转录的阿拉伯语方言。ADI共享任务数据集包括现代标准阿拉伯语(MSA)和四种阿拉伯语方言:埃及语、海湾语、黎凡特语和北非语。MAZA提交的三个系统是基于多个机器学习分类器的组合,排列为(1)投票集合;(2)平均概率系综;(3) meta-classifier。元分类器的准确率达到71.7%,在参与ADI共享任务的6个团队中排名第二。
{"title":"Arabic Dialect Identification Using iVectors and ASR Transcripts","authors":"S. Malmasi, Marcos Zampieri","doi":"10.18653/v1/W17-1222","DOIUrl":"https://doi.org/10.18653/v1/W17-1222","url":null,"abstract":"This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects: Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7% accuracy, ranking second among the six teams which participated in the ADI shared task.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125556005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Identifying dialects with textual and acoustic cues 通过文本和声音线索识别方言
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1211
Abualsoud Hanani, Aziz Qaroush, Stephen Eugene Taylor
We describe several systems for identifying short samples of Arabic or Swiss-German dialects, which were prepared for the shared task of the 2017 DSL Workshop (Zampieri et al., 2017). The Arabic data comprises both text and acoustic files, and our best run combined both. The Swiss-German data is text-only. Coincidently, our best runs achieved a accuracy of nearly 63% on both the Swiss-German and Arabic dialects tasks.
我们描述了几个用于识别阿拉伯语或瑞士德语方言短样本的系统,这些系统是为2017年DSL研讨会的共享任务而准备的(Zampieri et al., 2017)。阿拉伯语数据包括文本和声音文件,我们最好将两者结合起来。瑞士和德国的数据是纯文本的。巧合的是,我们在瑞士德语和阿拉伯语方言任务上的最佳准确率接近63%。
{"title":"Identifying dialects with textual and acoustic cues","authors":"Abualsoud Hanani, Aziz Qaroush, Stephen Eugene Taylor","doi":"10.18653/v1/W17-1211","DOIUrl":"https://doi.org/10.18653/v1/W17-1211","url":null,"abstract":"We describe several systems for identifying short samples of Arabic or Swiss-German dialects, which were prepared for the shared task of the 2017 DSL Workshop (Zampieri et al., 2017). The Arabic data comprises both text and acoustic files, and our best run combined both. The Swiss-German data is text-only. Coincidently, our best runs achieved a accuracy of nearly 63% on both the Swiss-German and Arabic dialects tasks.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122057084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Discriminating between Similar Languages with Word-level Convolutional Neural Networks 基于词级卷积神经网络的相似语言判别
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1215
Marcelo Criscuolo, S. Aluísio
Discriminating between Similar Languages (DSL) is a challenging task addressed at the VarDial Workshop series. We report on our participation in the DSL shared task with a two-stage system. In the first stage, character n-grams are used to separate language groups, then specialized classifiers distinguish similar language varieties. We have conducted experiments with three system configurations and submitted one run for each. Our main approach is a word-level convolutional neural network (CNN) that learns task-specific vectors with minimal text preprocessing. We also experiment with multi-layer perceptron (MLP) networks and another hybrid configuration. Our best run achieved an accuracy of 90.76%, ranking 8th among 11 participants and getting very close to the system that ranked first (less than 2 points). Even though the CNN model could not achieve the best results, it still makes a viable approach to discriminating between similar languages.
区分相似语言(DSL)是VarDial Workshop系列讨论的一项具有挑战性的任务。我们报告了我们在两阶段系统中参与DSL共享任务的情况。在第一阶段,使用字符n-图来区分语言组,然后专门的分类器区分相似的语言变体。我们对三种系统配置进行了实验,并为每种配置提交了一次运行。我们的主要方法是一个词级卷积神经网络(CNN),它通过最少的文本预处理来学习特定于任务的向量。我们还实验了多层感知器(MLP)网络和另一种混合配置。我们的最佳运行准确率达到了90.76%,在11个参与者中排名第8,与排名第一的系统非常接近(不到2分)。尽管CNN模型不能达到最好的结果,但它仍然是一种可行的方法来区分相似的语言。
{"title":"Discriminating between Similar Languages with Word-level Convolutional Neural Networks","authors":"Marcelo Criscuolo, S. Aluísio","doi":"10.18653/v1/W17-1215","DOIUrl":"https://doi.org/10.18653/v1/W17-1215","url":null,"abstract":"Discriminating between Similar Languages (DSL) is a challenging task addressed at the VarDial Workshop series. We report on our participation in the DSL shared task with a two-stage system. In the first stage, character n-grams are used to separate language groups, then specialized classifiers distinguish similar language varieties. We have conducted experiments with three system configurations and submitted one run for each. Our main approach is a word-level convolutional neural network (CNN) that learns task-specific vectors with minimal text preprocessing. We also experiment with multi-layer perceptron (MLP) networks and another hybrid configuration. Our best run achieved an accuracy of 90.76%, ranking 8th among 11 participants and getting very close to the system that ranked first (less than 2 points). Even though the CNN model could not achieve the best results, it still makes a viable approach to discriminating between similar languages.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122145101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Multi-source morphosyntactic tagging for spoken Rusyn 俄语口语多源形态句法标注
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1210
Yves Scherrer, Achim Rabus
This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9%, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4% for part-of-speech tagging and 75.5% for full morphological tagging.
本文讨论了斯拉夫少数民族俄语口语变体形态句法标注器的发展。由于Rusyn没有注释语料库和平行语料库的电子可用性,我们建议将语源学上接近的斯拉夫语言俄语、乌克兰语、斯洛伐克语和波兰语的现有资源结合起来,并使它们适应Rusyn。使用MarMoT作为标记工具包,我们发现在四种源语言的平衡集上训练的标记器比单语言标记器的性能高出约9%,并且额外的自动诱导的形态句法词汇可以进一步提高标记器的性能。Rusyn的词性标注和全形态标注的准确率分别为82.4%和75.5%。
{"title":"Multi-source morphosyntactic tagging for spoken Rusyn","authors":"Yves Scherrer, Achim Rabus","doi":"10.18653/v1/W17-1210","DOIUrl":"https://doi.org/10.18653/v1/W17-1210","url":null,"abstract":"This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9%, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4% for part-of-speech tagging and 75.5% for full morphological tagging.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130130320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
German Dialect Identification in Interview Transcriptions 访谈笔录中的德语方言识别
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1220
S. Malmasi, Marcos Zampieri
This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task consists of training models to identify the dialect of Swiss-German speech transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne, and Zurich. The three systems we submitted are based on: a plurality ensemble, a mean probability ensemble, and a meta-classifier trained on character and word n-grams. The best results were obtained by the meta-classifier achieving 68.1% accuracy and 66.2% F1-score, ranking first among the 10 teams which participated in the GDI shared task.
本文介绍了2017年VarDial评估活动中提交给德语方言识别(GDI)任务的三个系统。该任务包括训练模型来识别瑞士德语语音文本的方言。GDI数据集中的方言包括巴塞尔、伯尔尼、卢塞恩和苏黎世。我们提交的三个系统是基于:一个多元集成,一个平均概率集成和一个基于字符和单词n-grams训练的元分类器。元分类器的准确率为68.1%,f1得分为66.2%,在参与GDI共享任务的10个团队中排名第一。
{"title":"German Dialect Identification in Interview Transcriptions","authors":"S. Malmasi, Marcos Zampieri","doi":"10.18653/v1/W17-1220","DOIUrl":"https://doi.org/10.18653/v1/W17-1220","url":null,"abstract":"This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task consists of training models to identify the dialect of Swiss-German speech transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne, and Zurich. The three systems we submitted are based on: a plurality ensemble, a mean probability ensemble, and a meta-classifier trained on character and word n-grams. The best results were obtained by the meta-classifier achieving 68.1% accuracy and 66.2% F1-score, ranking first among the 10 teams which participated in the GDI shared task.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125950498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies 为什么用神经机器翻译加泰罗尼亚语和西班牙语?与标准规则和短语技术的分析、比较和结合
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1207
M. Costa-jussà
Catalan and Spanish are two related languages given that both derive from Latin. They share similarities in several linguistic levels including morphology, syntax and semantics. This makes them particularly interesting for the MT task. Given the recent appearance and popularity of neural MT, this paper analyzes the performance of this new approach compared to the well-established rule-based and phrase-based MT systems. Experiments are reported on a large database of 180 million words. Results, in terms of standard automatic measures, show that neural MT clearly outperforms the rule-based and phrase-based MT system on in-domain test set, but it is worst in the out-of-domain test set. A naive system combination specially works for the latter. In-domain manual analysis shows that neural MT tends to improve both adequacy and fluency, for example, by being able to generate more natural translations instead of literal ones, choosing to the adequate target word when the source word has several translations and improving gender agreement. However, out-of-domain manual analysis shows how neural MT is more affected by unknown words or contexts.
加泰罗尼亚语和西班牙语是两种相关的语言,因为它们都源于拉丁语。它们在词法、句法和语义等几个语言学层面上都有相似之处。这使得它们对于MT任务特别有趣。鉴于神经机器翻译最近的出现和普及,本文分析了这种新方法与已建立的基于规则和基于短语的机器翻译系统的性能。实验报告在一个1.8亿字的大数据库上。结果表明,在标准自动度量方面,神经机器翻译在域内测试集上明显优于基于规则和短语的机器翻译系统,但在域外测试集上表现最差。朴素系统组合专门适用于后者。领域内人工分析表明,神经机器翻译倾向于提高充分性和流畅性,例如,通过能够生成更自然的翻译而不是字面翻译,当源词有多个翻译时选择适当的目标词,以及提高性别一致性。然而,域外人工分析表明,神经机器翻译更容易受到未知单词或上下文的影响。
{"title":"Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies","authors":"M. Costa-jussà","doi":"10.18653/v1/W17-1207","DOIUrl":"https://doi.org/10.18653/v1/W17-1207","url":null,"abstract":"Catalan and Spanish are two related languages given that both derive from Latin. They share similarities in several linguistic levels including morphology, syntax and semantics. This makes them particularly interesting for the MT task. Given the recent appearance and popularity of neural MT, this paper analyzes the performance of this new approach compared to the well-established rule-based and phrase-based MT systems. Experiments are reported on a large database of 180 million words. Results, in terms of standard automatic measures, show that neural MT clearly outperforms the rule-based and phrase-based MT system on in-domain test set, but it is worst in the out-of-domain test set. A naive system combination specially works for the latter. In-domain manual analysis shows that neural MT tends to improve both adequacy and fluency, for example, by being able to generate more natural translations instead of literal ones, choosing to the adequate target word when the source word has several translations and improving gender agreement. However, out-of-domain manual analysis shows how neural MT is more affected by unknown words or contexts.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127781129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Slavic Forest, Norwegian Wood 斯拉夫森林,挪威森林
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1226
Rudolf Rosa, Daniel Zeman, D. Mareček, Z. Žabokrtský
We once had a corp, or should we say, it once had us They showed us its tags, isn’t it great, unified tags They asked us to parse and they told us to use everything So we looked around and we noticed there was near nothing We took other langs, bitext aligned: words one-to-one We played for two weeks, and then they said, here is the test The parser kept training till morning, just until deadline So we had to wait and hope what we get would be just fine And, when we awoke, the results were done, we saw we’d won So, we wrote this paper, isn’t it good, Norwegian wood.
我们曾经有一个公司,或者我们应该说,它曾经有我们他们给我们看它的标签,是不是很棒,统一的标签他们让我们分析,他们告诉我们使用所有的东西,所以我们环顾四周,我们注意到几乎没有什么,我们使用其他语言,文本对齐:单词一对一我们玩了两个星期,然后他们说,这是测试解析器一直训练到早上,就快到截止日期了所以我们只能等着,希望我们得到的结果会很好当我们醒来的时候,结果出来了,我们看到我们赢了所以,我们写了这篇论文,是不是很好,挪威的森林。
{"title":"Slavic Forest, Norwegian Wood","authors":"Rudolf Rosa, Daniel Zeman, D. Mareček, Z. Žabokrtský","doi":"10.18653/v1/W17-1226","DOIUrl":"https://doi.org/10.18653/v1/W17-1226","url":null,"abstract":"We once had a corp, or should we say, it once had us They showed us its tags, isn’t it great, unified tags They asked us to parse and they told us to use everything So we looked around and we noticed there was near nothing We took other langs, bitext aligned: words one-to-one We played for two weeks, and then they said, here is the test The parser kept training till morning, just until deadline So we had to wait and hope what we get would be just fine And, when we awoke, the results were done, we saw we’d won So, we wrote this paper, isn’t it good, Norwegian wood.","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123824145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk) PAN作者简介:从年龄、性别认同到语言多样性认同(特邀演讲)
Pub Date : 1900-01-01 DOI: 10.18653/v1/W17-1205
Paolo Rosso
Author profiling is the study of how language is shared by people, a problem of growing importance in applications dealing with security, in order to understand who could be behind an anonymous threat message, and marketing, where companies may be interested in knowing the demographics of people that in online reviews liked or disliked their products. In this talk we will give an overview of the PAN shared tasks that since 2013 have been organised at CLEF and FIRE evaluation forums, mainly on age and gender identification in social media, although also personality recognition in Twitter as well as in code sources was also addressed. In 2017 the PAN author profiling shared task addresses jointly gender and language variety identification in Twitter where tweets have been annotated with authors’ gender and their specific variation of their native language: English (Australia, Canada, Great Britain, Ireland, New Zealand, United States), Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela), Portuguese (Brazil, Portugal), and Arabic (Egypt, Gulf, Levantine, Maghrebi).
作者分析研究的是人们如何共享语言,这在处理安全的应用程序中是一个越来越重要的问题,目的是了解谁可能是匿名威胁消息的幕后黑手,以及市场营销,公司可能对了解在线评论中喜欢或不喜欢他们产品的用户的人口统计数据感兴趣。在这次演讲中,我们将概述自2013年以来在CLEF和FIRE评估论坛上组织的PAN共享任务,主要是关于社交媒体中的年龄和性别识别,尽管也讨论了Twitter和代码源中的个性识别。2017年,PAN作者分析共享任务在Twitter上共同解决了性别和语言多样性识别问题,其中推文被注释为作者的性别及其母语的具体变化:英语(澳大利亚、加拿大、英国、爱尔兰、新西兰、美国)、西班牙语(阿根廷、智利、哥伦比亚、墨西哥、秘鲁、西班牙、委内瑞拉)、葡萄牙语(巴西、葡萄牙)和阿拉伯语(埃及、海湾、黎凡特、马格里布)。
{"title":"Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk)","authors":"Paolo Rosso","doi":"10.18653/v1/W17-1205","DOIUrl":"https://doi.org/10.18653/v1/W17-1205","url":null,"abstract":"Author profiling is the study of how language is shared by people, a problem of growing importance in applications dealing with security, in order to understand who could be behind an anonymous threat message, and marketing, where companies may be interested in knowing the demographics of people that in online reviews liked or disliked their products. In this talk we will give an overview of the PAN shared tasks that since 2013 have been organised at CLEF and FIRE evaluation forums, mainly on age and gender identification in social media, although also personality recognition in Twitter as well as in code sources was also addressed. In 2017 the PAN author profiling shared task addresses jointly gender and language variety identification in Twitter where tweets have been annotated with authors’ gender and their specific variation of their native language: English (Australia, Canada, Great Britain, Ireland, New Zealand, United States), Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela), Portuguese (Brazil, Portugal), and Arabic (Egypt, Gulf, Levantine, Maghrebi).","PeriodicalId":167439,"journal":{"name":"Workshop on NLP for Similar Languages, Varieties and Dialects","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122367504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Workshop on NLP for Similar Languages, Varieties and Dialects
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1