The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts

Q3 Social Sciences DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada Pub Date : 2023-01-01 DOI:10.1590/1678-460x202339355159

Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardo

{"title":"The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts","authors":"Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardo","doi":"10.1590/1678-460x202339355159","DOIUrl":null,"url":null,"abstract":"ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.","PeriodicalId":35332,"journal":{"name":"DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada","volume":"247 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1590/1678-460x202339355159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 2

Abstract

ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AMR-PT语料库与新闻和观点文本挑战句的语义注释

摘要自然语言处理(NLP)中最流行的语义表示语言之一是抽象意义表示(AMR)。这种形式将单句的意义编码在有向根图中。对于英语，有一个大型的带注释的语料库，为建立或改进现有的NLP方法和应用提供了定性和可重用的数据。对于非英语语言(包括巴西葡萄牙语)的AMR语料库的构建，采用了自动和手动策略。自动标注方法本质上是基于平行语料库的跨语言对齐和AMR标注的继承。手册策略侧重于使AMR英语指南适应目标语言。这两种注释策略都必须处理一些具有挑战性的现象。本文详细探讨了葡萄牙语AMR模型必须适应的一些特征，并介绍了两个注释语料库:AMRNews，一个来自新闻文本的870个注释句子的语料库，以及OpiSums-PT-AMR，包含AMR中404个固执己见的句子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada Social Sciences-Linguistics and Language

CiteScore

0.40

自引率

0.00%

发文量

审稿时长

52 weeks

期刊介绍： The journal Documentação de Estudos em Lingüística Teórica e Aplicada - DELTA is published by the Pontifícia Universidade Católica de São Paulo / PUC-SP. DELTA has been published since 1985, and in 1992 it became a biannual publication. Editions are published in February and August. The journal is addressed to all areas of study concerning language and speech, whether theoretical or applied; however, only unpublished contributions will be considered. To briefly refer to the journal, the short title DELTA is recommended regarding bibliographies, footnotes, as well as bibliographical strips and references.