Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2023-12-26 DOI:10.1007/s10579-023-09699-4

Pablo Faria, Charlotte Galves, Catarina Magro

{"title":"Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces","authors":"Pablo Faria, Charlotte Galves, Catarina Magro","doi":"10.1007/s10579-023-09699-4","DOIUrl":null,"url":null,"abstract":"<p>In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the <i>Penn Parsed Historical Corpora</i> (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro & Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-023-09699-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the Penn Parsed Historical Corpora (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro & Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

葡萄牙语语法注释：标准、解析器和搜索界面

在过去二十年中，按照最初为宾夕法尼亚大学历史语料库（Penn Parsed Historical Corpora）定义的思路，建立了四个葡萄牙语句法注释语料库（Santorini，2016 年）。它们涵盖了欧洲葡萄牙语的古、中、古典和现代时期，以及十九世纪和二十世纪的巴西葡萄牙语，并包括不同的文本流派和口头话语摘录。它们共同为研究葡萄牙语的变异和变化提供了基本资源。在过去几年中，我们努力最大限度地统一这些语料库的注释方案，以便在一个语料库上进行的检索可以在其他语料库上以完全相同的方式进行。这一努力的成果就是《葡萄牙语句法注释手册》（Magro & Galves, 2019）。在本文中，我们将介绍葡萄牙语语法注释。我们介绍了 ParsPort 的功能，这是一种基于规则的解析器，利用了查询语言语料库搜索（Corpus Search）的修订模式（Randall，2005-2015 年）。我们认为，与 Bikel（2004 年）开发的概率分析器相比，ParsPort 对我们的注释工作更有效率。最后，我们将提到最近在开发更方便用户的句法搜索工具方面取得的进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.