首页 > 最新文献

Proceedings of the Sixth Workshop on最新文献

英文 中文
DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification DTeam @ VarDial 2019:基于skip-gram和三重损失神经网络的集成,用于摩尔多瓦语与罗马尼亚语跨方言主题识别
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1422
D. Tudoreanu
This paper presents the solution proposed by DTeam in the VarDial 2019 Evaluation Campaign for the Moldavian vs. Romanian cross-topic identification task. The solution proposed is a Support Vector Machines (SVM) ensemble composed of a two character-level neural networks. The first network is a skip-gram classification model formed of an embedding layer, three convolutional layers and two fully-connected layers. The second network has a similar architecture, but is trained using the triplet loss function.
本文介绍了DTeam在VarDial 2019评估活动中为摩尔多瓦与罗马尼亚的交叉主题识别任务提出的解决方案。提出的解决方案是由两个字符级神经网络组成的支持向量机(SVM)集成。第一个网络是由一个嵌入层、三个卷积层和两个全连接层组成的跳格分类模型。第二个网络具有类似的结构,但使用三重损失函数进行训练。
{"title":"DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification","authors":"D. Tudoreanu","doi":"10.18653/v1/W19-1422","DOIUrl":"https://doi.org/10.18653/v1/W19-1422","url":null,"abstract":"This paper presents the solution proposed by DTeam in the VarDial 2019 Evaluation Campaign for the Moldavian vs. Romanian cross-topic identification task. The solution proposed is a Support Vector Machines (SVM) ensemble composed of a two character-level neural networks. The first network is a skip-gram classification model formed of an embedding layer, three convolutional layers and two fully-connected layers. The second network has a similar architecture, but is trained using the triplet loss function.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134544582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Improving Cuneiform Language Identification with BERT 利用BERT改进楔形文字识别
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1402
Gabriel Bernier-Colborne, Cyril Goutte, Serge Léger
We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.
我们描述了加拿大国家研究委员会在2019年VarDial评估活动中为楔形文字识别(CLI)共享任务开发的系统。我们比较了基于字符n-图的最先进的基线和传统的统计分类器、分类器的投票集合和使用Transformer网络的深度学习方法。我们描述了这些系统是如何训练的,并分析了一些预处理和模型估计决策的影响。深度神经网络在测试数据上达到了77%的准确率,这在CLI评估中被证明是最好的表现,为楔形文字识别建立了新的技术水平。
{"title":"Improving Cuneiform Language Identification with BERT","authors":"Gabriel Bernier-Colborne, Cyril Goutte, Serge Léger","doi":"10.18653/v1/W19-1402","DOIUrl":"https://doi.org/10.18653/v1/W19-1402","url":null,"abstract":"We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130481290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Comparing Pipelined and Integrated Approaches to Dialectal Arabic Neural Machine Translation 阿拉伯文方言神经机器翻译的流水线与集成方法比较
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1424
Pamela Shapiro, Kevin Duh
When translating diglossic languages such as Arabic, situations may arise where we would like to translate a text but do not know which dialect it is. A traditional approach to this problem is to design dialect identification systems and dialect-specific machine translation systems. However, under the recent paradigm of neural machine translation, shared multi-dialectal systems have become a natural alternative. Here we explore under which conditions it is beneficial to perform dialect identification for Arabic neural machine translation versus using a general system for all dialects.
在翻译像阿拉伯语这样的二元语言时,可能会出现这样的情况:我们想翻译一篇文章,但不知道它是哪种方言。解决这一问题的传统方法是设计方言识别系统和特定方言的机器翻译系统。然而,在最近的神经机器翻译范式下,共享的多方言系统已成为一种自然的选择。本文探讨了在哪些条件下对阿拉伯语神经机器翻译进行方言识别比对所有方言使用通用系统更有利。
{"title":"Comparing Pipelined and Integrated Approaches to Dialectal Arabic Neural Machine Translation","authors":"Pamela Shapiro, Kevin Duh","doi":"10.18653/v1/W19-1424","DOIUrl":"https://doi.org/10.18653/v1/W19-1424","url":null,"abstract":"When translating diglossic languages such as Arabic, situations may arise where we would like to translate a text but do not know which dialect it is. A traditional approach to this problem is to design dialect identification systems and dialect-specific machine translation systems. However, under the recent paradigm of neural machine translation, shared multi-dialectal systems have become a natural alternative. Here we explore under which conditions it is beneficial to perform dialect identification for Arabic neural machine translation versus using a general system for all dialects.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123649072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Neural and Linear Pipeline Approaches to Cross-lingual Morphological Analysis 跨语言形态分析的神经和线性管道方法
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1416
Çagri Çöltekin, Jeremy Barnes
This paper describes Tübingen-Oslo team’s participation in the cross-lingual morphological analysis task in the VarDial 2019 evaluation campaign. We participated in the shared task with a standard neural network model. Our model achieved analysis F1-scores of 31.48 and 23.67 on test languages Karachay-Balkar (Turkic) and Sardinian (Romance) respectively. The scores are comparable to the scores obtained by the other participants in both language families, and the analysis score on the Romance data set was also the best result obtained in the shared task. Besides describing the system used in our shared task participation, we describe another, simpler, model based on linear classifiers, and present further analyses using both models. Our analyses, besides revealing some of the difficult cases, also confirm that the usefulness of a source language in this task is highly correlated with the similarity of source and target languages.
本文描述了宾根-奥斯陆团队在VarDial 2019评估活动中参与跨语言形态分析任务。我们用一个标准的神经网络模型参与了共享任务。我们的模型在测试语言卡拉恰伊-巴尔卡尔语(突厥语)和撒丁语(罗曼语)上分别获得了31.48分和23.67分的分析f1分。该分数与两个语系的其他参与者的分数相当,并且罗曼语数据集的分析分数也是在共享任务中获得的最佳结果。除了描述共享任务参与中使用的系统外,我们还描述了另一个更简单的基于线性分类器的模型,并使用这两个模型进行了进一步的分析。我们的分析除了揭示了一些困难的案例外,还证实了源语言在这项任务中的有用性与源语言和目标语言的相似性高度相关。
{"title":"Neural and Linear Pipeline Approaches to Cross-lingual Morphological Analysis","authors":"Çagri Çöltekin, Jeremy Barnes","doi":"10.18653/v1/W19-1416","DOIUrl":"https://doi.org/10.18653/v1/W19-1416","url":null,"abstract":"This paper describes Tübingen-Oslo team’s participation in the cross-lingual morphological analysis task in the VarDial 2019 evaluation campaign. We participated in the shared task with a standard neural network model. Our model achieved analysis F1-scores of 31.48 and 23.67 on test languages Karachay-Balkar (Turkic) and Sardinian (Romance) respectively. The scores are comparable to the scores obtained by the other participants in both language families, and the analysis score on the Romance data set was also the best result obtained in the shared task. Besides describing the system used in our shared task participation, we describe another, simpler, model based on linear classifiers, and present further analyses using both models. Our analyses, besides revealing some of the difficult cases, also confirm that the usefulness of a source language in this task is highly correlated with the similarity of source and target languages.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121278100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Initial Experiments In Cross-Lingual Morphological Analysis Using Morpheme Segmentation 基于语素分割的跨语言形态分析初步实验
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1415
V. Mikhailov, Lorenzo Tosi, Anastasia Khorosheva, O. Serikov
The paper describes initial experiments in data-driven cross-lingual morphological analysis of open-category words using a combination of unsupervised morpheme segmentation, annotation projection and an LSTM encoder-decoder model with attention. Our algorithm provides lemmatisation and morphological analysis generation for previously unseen low-resource language surface forms with only annotated data on the related languages given. Despite the inherently lossy annotation projection, we achieved the best lemmatisation F1-score in the VarDial 2019 Shared Task on Cross-Lingual Morphological Analysis for both Karachay-Balkar (Turkic languages, agglutinative morphology) and Sardinian (Romance languages, fusional morphology).
本文介绍了基于数据驱动的开放类词跨语言形态分析的初步实验,该实验采用无监督语素切分、标注投影和带注意的LSTM编码器-解码器模型相结合的方法。我们的算法为以前未见过的低资源语言表面形式提供词源化和形态分析生成,仅提供相关语言的注释数据。尽管存在固有的有损标注投影,但我们在VarDial 2019跨语言形态学分析共享任务中对卡拉恰伊-巴尔卡尔语(突厥语,黏着形态学)和撒丁语(罗曼语,融合形态学)都取得了最好的词源化f1分。
{"title":"Initial Experiments In Cross-Lingual Morphological Analysis Using Morpheme Segmentation","authors":"V. Mikhailov, Lorenzo Tosi, Anastasia Khorosheva, O. Serikov","doi":"10.18653/v1/W19-1415","DOIUrl":"https://doi.org/10.18653/v1/W19-1415","url":null,"abstract":"The paper describes initial experiments in data-driven cross-lingual morphological analysis of open-category words using a combination of unsupervised morpheme segmentation, annotation projection and an LSTM encoder-decoder model with attention. Our algorithm provides lemmatisation and morphological analysis generation for previously unseen low-resource language surface forms with only annotated data on the related languages given. Despite the inherently lossy annotation projection, we achieved the best lemmatisation F1-score in the VarDial 2019 Shared Task on Cross-Lingual Morphological Analysis for both Karachay-Balkar (Turkic languages, agglutinative morphology) and Sardinian (Romance languages, fusional morphology).","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130811768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Variation between Different Discourse Types: Literate vs. Oral 不同话语类型之间的差异:文学与口头
Pub Date : 1900-01-01 DOI: 10.18653/v1/W19-1407
Katrin Ortmann, Stefanie Dipper
This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers.
本文研究德语文本中读写话语的自动识别问题。选择了一系列语言特征,并使用决策树分类器研究了它们在区分识字和口语导向语域中的作用。结果表明,所有被调查的特征都在某种程度上与口头概念有关。尤其是简单的复杂性测量(平均句子和单词长度)是口语和文学话语的重要指标。此外,指称和指示的特征(通过不同类型的代词实现)也被证明对确定不同语域的口语程度非常有用。
{"title":"Variation between Different Discourse Types: Literate vs. Oral","authors":"Katrin Ortmann, Stefanie Dipper","doi":"10.18653/v1/W19-1407","DOIUrl":"https://doi.org/10.18653/v1/W19-1407","url":null,"abstract":"This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers.","PeriodicalId":344344,"journal":{"name":"Proceedings of the Sixth Workshop on","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116773878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Proceedings of the Sixth Workshop on
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1