Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardo
{"title":"The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts","authors":"Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardo","doi":"10.1590/1678-460x202339355159","DOIUrl":null,"url":null,"abstract":"ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.","PeriodicalId":35332,"journal":{"name":"DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada","volume":"247 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1590/1678-460x202339355159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 2
Abstract
ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.
期刊介绍:
The journal Documentação de Estudos em Lingüística Teórica e Aplicada - DELTA is published by the Pontifícia Universidade Católica de São Paulo / PUC-SP. DELTA has been published since 1985, and in 1992 it became a biannual publication. Editions are published in February and August. The journal is addressed to all areas of study concerning language and speech, whether theoretical or applied; however, only unpublished contributions will be considered. To briefly refer to the journal, the short title DELTA is recommended regarding bibliographies, footnotes, as well as bibliographical strips and references.