GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing

Q3 Environmental Science AACL Bioflux Pub Date : 2022-10-19 DOI:10.48550/arXiv.2210.10449

Siyao Peng, Yang Janet Liu, Amir Zeldes

引用次数: 5

Abstract

A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset’s parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向多体裁、多语言语篇分析的汉语RST树库

大规模人工标注数据的缺乏阻碍了汉语分层语篇分析。本文提出了在修辞结构理论(RST)框架下最大的汉语分层语篇树库GCDT。GCDT涵盖了5种免费文本类型的60K多个标记，使用与当代RST英语树库相同的关系库。我们还报告了该数据集的分析实验，包括中文RST分析的最先进(SOTA)分数和英语GUM数据集上的RST分析，使用多语言嵌入的中英文跨语言训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量