Shallow discourse parsing for German

Dissertations in Artificial Intelligence Pub Date : 2021-06-22 DOI:10.25932/PUBLISHUP-50663

Peter Bourgonje

{"title":"Shallow discourse parsing for German","authors":"Peter Bourgonje","doi":"10.25932/PUBLISHUP-50663","DOIUrl":null,"url":null,"abstract":"While the last few decades have seen impressive improvements in several areas in Natural Language Processing, asking a computer to make sense of the discourse of utterances in a text remains challenging. There are several different theories that aim to describe and analyse the coherent structure that a well-written text inhibits. These theories have varying degrees of applicability and feasibility for practical use. Presumably the most data-driven of these theories is the paradigm that comes with the Penn Discourse TreeBank, a corpus annotated for discourse relations containing over 1 million words. Any language other than English however, can be considered a low-resource language when it comes to discourse processing. \n \nThis dissertation is about shallow discourse parsing (discourse parsing following the paradigm of the Penn Discourse TreeBank) for German. The limited availability of annotated data for German means the potential of modern, deep-learning based methods relying on such data is also limited. This dissertation explores to what extent machine-learning and more recent deep-learning based methods can be combined with traditional, linguistic feature engineering to improve performance for the discourse parsing task. A pivotal role is played by connective lexicons that exhaustively list the discourse connectives of a particular language along with some of their core properties. \n \nTo facilitate training and evaluation of the methods proposed in this dissertation, an existing corpus (the Potsdam Commentary Corpus) has been extended and additional data has been annotated from scratch. The approach to end-to-end shallow discourse parsing for German adopts a pipeline architecture and either presents the first results or improves over state-of-the-art for German for the individual sub-tasks of the discourse parsing task, which are, in processing order, connective identification, argument extraction and sense classification. The end-to-end shallow discourse parser for German that has been developed for the purpose of this dissertation is open-source and available online. \n \nIn the course of writing this dissertation, work has been carried out on several connective lexicons in different languages. Due to their central role and demonstrated usefulness for the methods proposed in this dissertation, strategies are discussed for creating or further developing such lexicons for a particular language, as well as suggestions on how to further increase their usefulness for shallow discourse parsing.","PeriodicalId":438444,"journal":{"name":"Dissertations in Artificial Intelligence","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dissertations in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25932/PUBLISHUP-50663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

While the last few decades have seen impressive improvements in several areas in Natural Language Processing, asking a computer to make sense of the discourse of utterances in a text remains challenging. There are several different theories that aim to describe and analyse the coherent structure that a well-written text inhibits. These theories have varying degrees of applicability and feasibility for practical use. Presumably the most data-driven of these theories is the paradigm that comes with the Penn Discourse TreeBank, a corpus annotated for discourse relations containing over 1 million words. Any language other than English however, can be considered a low-resource language when it comes to discourse processing. This dissertation is about shallow discourse parsing (discourse parsing following the paradigm of the Penn Discourse TreeBank) for German. The limited availability of annotated data for German means the potential of modern, deep-learning based methods relying on such data is also limited. This dissertation explores to what extent machine-learning and more recent deep-learning based methods can be combined with traditional, linguistic feature engineering to improve performance for the discourse parsing task. A pivotal role is played by connective lexicons that exhaustively list the discourse connectives of a particular language along with some of their core properties. To facilitate training and evaluation of the methods proposed in this dissertation, an existing corpus (the Potsdam Commentary Corpus) has been extended and additional data has been annotated from scratch. The approach to end-to-end shallow discourse parsing for German adopts a pipeline architecture and either presents the first results or improves over state-of-the-art for German for the individual sub-tasks of the discourse parsing task, which are, in processing order, connective identification, argument extraction and sense classification. The end-to-end shallow discourse parser for German that has been developed for the purpose of this dissertation is open-source and available online. In the course of writing this dissertation, work has been carried out on several connective lexicons in different languages. Due to their central role and demonstrated usefulness for the methods proposed in this dissertation, strategies are discussed for creating or further developing such lexicons for a particular language, as well as suggestions on how to further increase their usefulness for shallow discourse parsing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

德语浅语篇分析

虽然在过去的几十年里，自然语言处理在几个领域取得了令人印象深刻的进步，但要求计算机理解文本中话语的话语仍然具有挑战性。有几种不同的理论旨在描述和分析一个写得好的文本所抑制的连贯结构。这些理论在实际应用中具有不同程度的适用性和可行性。这些理论中最受数据驱动的大概是宾夕法尼亚大学话语树库(Penn Discourse TreeBank)的范式，这是一个包含超过100万单词的话语关系注释的语料库。然而，在话语处理方面，除了英语之外的任何语言都可以被认为是低资源语言。本文主要研究德语的浅层语篇分析(基于宾州语篇树库的语篇分析)。德语注释数据的有限可用性意味着依赖于这些数据的基于现代深度学习的方法的潜力也有限。本文探讨了机器学习和最近基于深度学习的方法在多大程度上可以与传统的语言特征工程相结合，以提高语篇解析任务的性能。连接词在其中起着关键作用，它详尽地列出了特定语言的语篇连接词及其一些核心属性。为了便于训练和评估本文提出的方法，我们扩展了现有的语料库(波茨坦评论语料库)，并从头注释了额外的数据。德语端到端浅语篇解析方法采用管道架构，对于语篇解析任务的各个子任务(按处理顺序依次为连接识别、论点提取和意义分类)，该方法要么给出了德语的第一个结果，要么进行了改进。本论文开发的端到端德语浅语篇解析器是开源的，可以在网上使用。在本文的写作过程中，对几种不同语言的连接词汇进行了研究。由于本文所提方法的核心作用和实用性，本文讨论了为特定语言创建或进一步开发此类词汇的策略，以及如何进一步提高其对浅语篇分析的有用性的建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Dissertations in Artificial Intelligence

自引率

0.00%

发文量

期刊最新文献

Shallow discourse parsing for German Knowledge representation and inductive reasoning using conditional logic and sets of ranking functions A change-oriented architecture for mathematical authoring assistance