A Benchmark for Neural Readability Assessment of Texts in Spanish

Laura Vásquez-Rodríguez, P. Cuenca-Jiménez, Sergio Morales-Esquivel, Fernando Alva-Manchego
{"title":"A Benchmark for Neural Readability Assessment of Texts in Spanish","authors":"Laura Vásquez-Rodríguez, P. Cuenca-Jiménez, Sergio Morales-Esquivel, Fernando Alva-Manchego","doi":"10.18653/v1/2022.tsar-1.18","DOIUrl":null,"url":null,"abstract":"We release a new benchmark for Automated Readability Assessment (ARA) of texts in Spanish. We combined existing corpora with suitable texts collected from the Web, thus creating the largest available dataset for ARA of Spanish texts. All data was pre-processed and categorised to allow experimenting with ARA models that make predictions at two (simple and complex) or three (basic, intermediate, and advanced) readability levels, and at two text granularities (paragraphs and sentences). An analysis based on readability indices shows that our proposed datasets groupings are suitable for their designated readability level. We use our benchmark to train neural ARA models based on BERT in zero-shot, few-shot, and cross-lingual settings. Results show that either a monolingual or multilingual pre-trained model can achieve good results when fine-tuned in language-specific data. In addition, all mod- els decrease their performance when predicting three classes instead of two, showing opportunities for the development of better ARA models for Spanish with existing resources.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.tsar-1.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

We release a new benchmark for Automated Readability Assessment (ARA) of texts in Spanish. We combined existing corpora with suitable texts collected from the Web, thus creating the largest available dataset for ARA of Spanish texts. All data was pre-processed and categorised to allow experimenting with ARA models that make predictions at two (simple and complex) or three (basic, intermediate, and advanced) readability levels, and at two text granularities (paragraphs and sentences). An analysis based on readability indices shows that our proposed datasets groupings are suitable for their designated readability level. We use our benchmark to train neural ARA models based on BERT in zero-shot, few-shot, and cross-lingual settings. Results show that either a monolingual or multilingual pre-trained model can achieve good results when fine-tuned in language-specific data. In addition, all mod- els decrease their performance when predicting three classes instead of two, showing opportunities for the development of better ARA models for Spanish with existing resources.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于神经网络的西班牙语文本可读性评价基准
我们发布了西班牙语文本自动可读性评估(ARA)的新基准。我们将现有的语料库与从网络上收集的合适文本结合起来,从而创建了西班牙语文本ARA的最大可用数据集。所有数据都经过预处理和分类,以允许ARA模型进行试验,该模型可以在两个(简单和复杂)或三个(基本,中级和高级)可读性级别以及两个文本粒度(段落和句子)上进行预测。基于可读性指标的分析表明,我们提出的数据集分组适合其指定的可读性水平。我们使用我们的基准在零镜头、少镜头和跨语言设置下训练基于BERT的神经ARA模型。结果表明,无论是单语言预训练模型还是多语言预训练模型,在特定语言的数据中进行微调后,都能取得良好的效果。此外,所有的模型在预测三个等级而不是两个等级时都会降低性能,这显示了利用现有资源开发更好的西班牙语ARA模型的机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Parallel Corpus Filtering for Japanese Text Simplification teamPN at TSAR-2022 Shared Task: Lexical Simplification using Multi-Level and Modular Approach CENTAL at TSAR-2022 Shared Task: How Does Context Impact BERT-Generated Substitutions for Lexical Simplification? A Benchmark for Neural Readability Assessment of Texts in Spanish A Dataset of Word-Complexity Judgements from Deaf and Hard-of-Hearing Adults for Text Simplification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1