IrekiaLFes:西班牙语自动文本简化的一个新的开放基准和基线系统

Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.tsar-1.8

Itziar Gonzalez-Dios, Iker Gutiérrez-Fandiño, Oscar M. Cumbicus-Pineda, Aitor Soroa Etxabe

{"title":"IrekiaLFes:西班牙语自动文本简化的一个新的开放基准和基线系统","authors":"Itziar Gonzalez-Dios, Iker Gutiérrez-Fandiño, Oscar M. Cumbicus-Pineda, Aitor Soroa Etxabe","doi":"10.18653/v1/2022.tsar-1.8","DOIUrl":null,"url":null,"abstract":"Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"IrekiaLFes: a New Open Benchmark and Baseline Systems for Spanish Automatic Text Simplification\",\"authors\":\"Itziar Gonzalez-Dios, Iker Gutiérrez-Fandiño, Oscar M. Cumbicus-Pineda, Aitor Soroa Etxabe\",\"doi\":\"10.18653/v1/2022.tsar-1.8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.\",\"PeriodicalId\":247582,\"journal\":{\"name\":\"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.tsar-1.8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.tsar-1.8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

自动文本简化(ATS)旨在为公众或目标受众降低文本的复杂性。在过去的几年里，深度学习方法已经成为ATS研究中使用最多的系统，但是这些系统需要大量高质量的数据集来进行评估。此外，这些数据只能大规模地用于英语，在某些情况下还需要限制性许可。在本文中，我们提出了IrekiaLF_es，这是一个西班牙语文本简化的开放许可基准。它由文档级语料库和手动对齐的句子级测试集组成。我们还对语料库进行了基于神经语言学的评估，以揭示其对文本简化的适用性。该评估遵循神经语言复杂性评估的词典-统一-线性(LeULi)模型。最后，我们给出了一组零射击场景下ATS系统的实验和基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

IrekiaLFes: a New Open Benchmark and Baseline Systems for Spanish Automatic Text Simplification

Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

自引率

0.00%

发文量