Toward a shallow discourse parser for Turkish

IF 2.3 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2023-08-11 DOI:10.1017/s1351324923000359

Ferhat Kutlu, Deniz Zeyrek, Murathan Kurfali

{"title":"Toward a shallow discourse parser for Turkish","authors":"Ferhat Kutlu, Deniz Zeyrek, Murathan Kurfali","doi":"10.1017/s1351324923000359","DOIUrl":null,"url":null,"abstract":"\n One of the most interesting aspects of natural language is how texts cohere, which involves the pragmatic or semantic relations that hold between clauses (addition, cause-effect, conditional, similarity), referred to as discourse relations. A focus on the identification and classification of discourse relations appears as an imperative challenge to be resolved to support tasks such as text summarization, dialogue systems, and machine translation that need information above the clause level. Despite the recent interest in discourse relations in well-known languages such as English, data and experiments are still needed for typologically different and less-resourced languages. We report the most comprehensive investigation of shallow discourse parsing in Turkish, focusing on two main sub-tasks: identification of discourse relation realization types and the sense classification of explicit and implicit relations. The work is based on the approach of fine-tuning a pre-trained language model (BERT) as an encoder and classifying the encoded data with neural network-based classifiers. We firstly identify the discourse relation realization type that holds in a given text, if there is any. Then, we move on to the sense classification of the identified explicit and implicit relations. In addition to in-domain experiments on a held-out test set from the Turkish Discourse Bank (TDB 1.2), we also report the out-domain performance of our models in order to evaluate its generalization abilities, using the Turkish part of the TED Multilingual Discourse Bank. Finally, we explore the effect of multilingual data aggregation on the classification of relation realization type through a cross-lingual experiment. The results suggest that our models perform relatively well despite the limited size of the TDB 1.2 and that there are language-specific aspects of detecting the types of discourse relation realization. We believe that the findings are important both in providing insights regarding the performance of the modern language models in a typologically different language and in the low-resource scenario, given that the TDB 1.2 is 1/20th of the Penn Discourse TreeBank in terms of the number of total relations.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000359","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

One of the most interesting aspects of natural language is how texts cohere, which involves the pragmatic or semantic relations that hold between clauses (addition, cause-effect, conditional, similarity), referred to as discourse relations. A focus on the identification and classification of discourse relations appears as an imperative challenge to be resolved to support tasks such as text summarization, dialogue systems, and machine translation that need information above the clause level. Despite the recent interest in discourse relations in well-known languages such as English, data and experiments are still needed for typologically different and less-resourced languages. We report the most comprehensive investigation of shallow discourse parsing in Turkish, focusing on two main sub-tasks: identification of discourse relation realization types and the sense classification of explicit and implicit relations. The work is based on the approach of fine-tuning a pre-trained language model (BERT) as an encoder and classifying the encoded data with neural network-based classifiers. We firstly identify the discourse relation realization type that holds in a given text, if there is any. Then, we move on to the sense classification of the identified explicit and implicit relations. In addition to in-domain experiments on a held-out test set from the Turkish Discourse Bank (TDB 1.2), we also report the out-domain performance of our models in order to evaluate its generalization abilities, using the Turkish part of the TED Multilingual Discourse Bank. Finally, we explore the effect of multilingual data aggregation on the classification of relation realization type through a cross-lingual experiment. The results suggest that our models perform relatively well despite the limited size of the TDB 1.2 and that there are language-specific aspects of detecting the types of discourse relation realization. We believe that the findings are important both in providing insights regarding the performance of the modern language models in a typologically different language and in the low-resource scenario, given that the TDB 1.2 is 1/20th of the Penn Discourse TreeBank in terms of the number of total relations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

浅谈土耳其语语篇解析器

自然语言最有趣的方面之一是文本的连贯性，它涉及从句之间的语用或语义关系（附加、因果、条件、相似性），称为话语关系。关注话语关系的识别和分类似乎是一个迫切需要解决的挑战，以支持文本摘要、对话系统和机器翻译等需要子句级别以上信息的任务。尽管最近人们对英语等知名语言的话语关系感兴趣，但对于类型不同且资源较少的语言，仍然需要数据和实验。我们报道了对土耳其语浅层话语解析的最全面的调查，重点关注两个主要的子任务：话语关系实现类型的识别和显性和隐性关系的意义分类。这项工作是基于将预训练的语言模型（BERT）作为编码器进行微调，并使用基于神经网络的分类器对编码数据进行分类的方法。我们首先确定了在给定文本中存在的话语关系实现类型（如果有的话）。然后，我们继续对已识别的显性和隐性关系进行意义分类。除了在土耳其语语语篇库（TDB 1.2）的测试集上进行域内实验外，我们还报告了我们的模型的域外性能，以评估其泛化能力，使用TED多语言语篇库的土耳其文部分。最后，我们通过跨语言实验探讨了多语言数据聚合对关系实现类型分类的影响。结果表明，尽管TDB1.2的规模有限，但我们的模型表现相对较好，并且在检测话语关系实现的类型方面存在特定于语言的方面。我们认为，鉴于TDB 1.2在总关系数方面是宾夕法尼亚大学话语树库的1/20，这些发现对于深入了解现代语言模型在类型不同的语言和低资源场景中的表现都很重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.