首页 > 最新文献

Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)最新文献

英文 中文
PolyU-CBS at TSAR-2022 Shared Task: A Simple, Rank-Based Method for Complex Word Substitution in Two Steps PolyU-CBS在TSAR-2022共享任务:一种简单的、基于秩的两步复杂单词替换方法
Emmanuele Chersoni, Yu-Yin Hsu
In this paper, we describe the system we presented at the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022) regarding the shared task on Lexical Simplification for English, Portuguese, and Spanish. We proposed an unsupervised approach in two steps: First, we used a masked language model with word masking for each language to extract possible candidates for the replacement of a difficult word; second, we ranked the candidates according to three different Transformer-based metrics. Finally, we determined our list of candidates based on the lowest average rank across different metrics.
在本文中,我们描述了我们在文本简化、可访问性和可读性研讨会(TSAR-2022)上提出的关于英语、葡萄牙语和西班牙语词汇简化的共享任务的系统。我们分两步提出了一种无监督的方法:首先,我们使用对每种语言进行单词屏蔽的屏蔽语言模型来提取替换困难单词的可能候选词;其次,我们根据三个不同的基于transformer的指标对候选对象进行排名。最后,我们根据不同指标的最低平均排名确定候选名单。
{"title":"PolyU-CBS at TSAR-2022 Shared Task: A Simple, Rank-Based Method for Complex Word Substitution in Two Steps","authors":"Emmanuele Chersoni, Yu-Yin Hsu","doi":"10.18653/v1/2022.tsar-1.24","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.24","url":null,"abstract":"In this paper, we describe the system we presented at the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022) regarding the shared task on Lexical Simplification for English, Portuguese, and Spanish. We proposed an unsupervised approach in two steps: First, we used a masked language model with word masking for each language to extract possible candidates for the replacement of a difficult word; second, we ranked the candidates according to three different Transformer-based metrics. Finally, we determined our list of candidates based on the lowest average rank across different metrics.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125023817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Lexical Simplification in Foreign Language Learning: Creating Pedagogically Suitable Simplified Example Sentences 外语学习中的词汇简化:创造适合教学的简化例句
J. Degraeuwe, Horacio Saggion
This study presents a lexical simplification (LS) methodology for foreign language (FL) learning purposes, a barely explored area of automatic text simplification (TS). The method, targeted at Spanish as a foreign language (SFL), includes a customised complex word identification (CWI) classifier and generates substitutions based on masked language modelling. Performance is calculated on a custom dataset by means of a new, pedagogically-oriented evaluation. With 43% of the top simplifications being found suitable, the method shows potential for simplifying sentences to be used in FL learning activities. The evaluation also suggests that, though still crucial, meaning preservation is not always a prerequisite for successful LS. To arrive at grammatically correct and more idiomatic simplifications, future research could study the integration of association measures based on co-occurrence data.
本研究提出了一种用于外语学习的词汇简化(LS)方法,这是一个很少被探索的自动文本简化(TS)领域。该方法以西班牙语作为外语(SFL)为目标,包括一个定制的复杂单词识别(CWI)分类器,并基于掩码语言建模生成替换。性能是通过一种新的、以教学为导向的评估方法在一个自定义数据集上计算的。有43%的简化被发现是合适的,该方法显示了在外语学习活动中使用简化句子的潜力。评估还表明,尽管仍然至关重要,但意味着保存并不总是LS成功的先决条件。为了达到语法正确和更符合习语的简化,未来的研究可以研究基于共现数据的关联度量的整合。
{"title":"Lexical Simplification in Foreign Language Learning: Creating Pedagogically Suitable Simplified Example Sentences","authors":"J. Degraeuwe, Horacio Saggion","doi":"10.18653/v1/2022.tsar-1.9","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.9","url":null,"abstract":"This study presents a lexical simplification (LS) methodology for foreign language (FL) learning purposes, a barely explored area of automatic text simplification (TS). The method, targeted at Spanish as a foreign language (SFL), includes a customised complex word identification (CWI) classifier and generates substitutions based on masked language modelling. Performance is calculated on a custom dataset by means of a new, pedagogically-oriented evaluation. With 43% of the top simplifications being found suitable, the method shows potential for simplifying sentences to be used in FL learning activities. The evaluation also suggests that, though still crucial, meaning preservation is not always a prerequisite for successful LS. To arrive at grammatically correct and more idiomatic simplifications, future research could study the integration of association measures based on co-occurrence data.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132639388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Corpus Filtering for Japanese Text Simplification 日语文本简化的并行语料库过滤
Koki Hatagaki, Tomoyuki Kajiwara, Takashi Ninomiya
We propose a method of parallel corpus filtering for Japanese text simplification. The parallel corpus for this task contains some redundant wording. In this study, we first identify the type and size of noisy sentence pairs in the Japanese text simplification corpus. We then propose a method of parallel corpus filtering to remove each type of noisy sentence pair. Experimental results show that filtering the training parallel corpus with the proposed method improves simplification performance.
提出了一种并行语料库过滤的日语文本简化方法。这个任务的平行语料库包含一些冗余的措辞。在本研究中,我们首先识别了日语文本简化语料库中噪声句对的类型和大小。然后,我们提出了一种并行语料库过滤方法来去除每种类型的噪声句子对。实验结果表明,该方法对训练并行语料库进行过滤后,简化性能得到了提高。
{"title":"Parallel Corpus Filtering for Japanese Text Simplification","authors":"Koki Hatagaki, Tomoyuki Kajiwara, Takashi Ninomiya","doi":"10.18653/v1/2022.tsar-1.2","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.2","url":null,"abstract":"We propose a method of parallel corpus filtering for Japanese text simplification. The parallel corpus for this task contains some redundant wording. In this study, we first identify the type and size of noisy sentence pairs in the Japanese text simplification corpus. We then propose a method of parallel corpus filtering to remove each type of noisy sentence pair. Experimental results show that filtering the training parallel corpus with the proposed method improves simplification performance.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114063572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Benchmark for Neural Readability Assessment of Texts in Spanish 基于神经网络的西班牙语文本可读性评价基准
Laura Vásquez-Rodríguez, P. Cuenca-Jiménez, Sergio Morales-Esquivel, Fernando Alva-Manchego
We release a new benchmark for Automated Readability Assessment (ARA) of texts in Spanish. We combined existing corpora with suitable texts collected from the Web, thus creating the largest available dataset for ARA of Spanish texts. All data was pre-processed and categorised to allow experimenting with ARA models that make predictions at two (simple and complex) or three (basic, intermediate, and advanced) readability levels, and at two text granularities (paragraphs and sentences). An analysis based on readability indices shows that our proposed datasets groupings are suitable for their designated readability level. We use our benchmark to train neural ARA models based on BERT in zero-shot, few-shot, and cross-lingual settings. Results show that either a monolingual or multilingual pre-trained model can achieve good results when fine-tuned in language-specific data. In addition, all mod- els decrease their performance when predicting three classes instead of two, showing opportunities for the development of better ARA models for Spanish with existing resources.
我们发布了西班牙语文本自动可读性评估(ARA)的新基准。我们将现有的语料库与从网络上收集的合适文本结合起来,从而创建了西班牙语文本ARA的最大可用数据集。所有数据都经过预处理和分类,以允许ARA模型进行试验,该模型可以在两个(简单和复杂)或三个(基本,中级和高级)可读性级别以及两个文本粒度(段落和句子)上进行预测。基于可读性指标的分析表明,我们提出的数据集分组适合其指定的可读性水平。我们使用我们的基准在零镜头、少镜头和跨语言设置下训练基于BERT的神经ARA模型。结果表明,无论是单语言预训练模型还是多语言预训练模型,在特定语言的数据中进行微调后,都能取得良好的效果。此外,所有的模型在预测三个等级而不是两个等级时都会降低性能,这显示了利用现有资源开发更好的西班牙语ARA模型的机会。
{"title":"A Benchmark for Neural Readability Assessment of Texts in Spanish","authors":"Laura Vásquez-Rodríguez, P. Cuenca-Jiménez, Sergio Morales-Esquivel, Fernando Alva-Manchego","doi":"10.18653/v1/2022.tsar-1.18","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.18","url":null,"abstract":"We release a new benchmark for Automated Readability Assessment (ARA) of texts in Spanish. We combined existing corpora with suitable texts collected from the Web, thus creating the largest available dataset for ARA of Spanish texts. All data was pre-processed and categorised to allow experimenting with ARA models that make predictions at two (simple and complex) or three (basic, intermediate, and advanced) readability levels, and at two text granularities (paragraphs and sentences). An analysis based on readability indices shows that our proposed datasets groupings are suitable for their designated readability level. We use our benchmark to train neural ARA models based on BERT in zero-shot, few-shot, and cross-lingual settings. Results show that either a monolingual or multilingual pre-trained model can achieve good results when fine-tuned in language-specific data. In addition, all mod- els decrease their performance when predicting three classes instead of two, showing opportunities for the development of better ARA models for Spanish with existing resources.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117280587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
UoM&MMU at TSAR-2022 Shared Task: Prompt Learning for Lexical Simplification 共享任务:词汇简化的提示学习
Laura Vásquez-Rodríguez, Nhung T. H. Nguyen, M. Shardlow, S. Ananiadou
We present PromptLS, a method for fine-tuning large pre-trained Language Models (LM) to perform the task of Lexical Simplification. We use a predefined template to attain appropriate replacements for a term, and fine-tune a LM using this template on language specific datasets. We filter candidate lists in post-processing to improve accuracy. We demonstrate that our model can work in a) a zero shot setting (where we only require a pre-trained LM), b) a fine-tuned setting (where language-specific data is required), and c) a multilingual setting (where the model is pre-trained across multiple languages and fine-tuned in an specific language). Experimental results show that, although the zero-shot setting is competitive, its performance is still far from the fine-tuned setting. Also, the multilingual is unsurprisingly worse than the fine-tuned model. Among all TSAR-2022 Shared Task participants, our team was ranked second in Spanish and third in English.
我们提出了一种用于微调大型预训练语言模型(LM)以执行词汇简化任务的方法PromptLS。我们使用预定义的模板来获得术语的适当替换,并在特定于语言的数据集上使用该模板对LM进行微调。我们在后处理中过滤候选列表以提高准确性。我们证明了我们的模型可以在a)零射击设置(我们只需要预训练的LM), b)微调设置(需要特定语言的数据)以及c)多语言设置(其中模型跨多种语言进行预训练并在特定语言中进行微调)中工作。实验结果表明,虽然零弹设置具有竞争力,但其性能与微调设置相比仍有很大差距。此外,多语言模式比微调模式更糟糕也就不足为奇了。在所有TSAR-2022共享任务参与者中,我们的团队西班牙语排名第二,英语排名第三。
{"title":"UoM&MMU at TSAR-2022 Shared Task: Prompt Learning for Lexical Simplification","authors":"Laura Vásquez-Rodríguez, Nhung T. H. Nguyen, M. Shardlow, S. Ananiadou","doi":"10.18653/v1/2022.tsar-1.23","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.23","url":null,"abstract":"We present PromptLS, a method for fine-tuning large pre-trained Language Models (LM) to perform the task of Lexical Simplification. We use a predefined template to attain appropriate replacements for a term, and fine-tune a LM using this template on language specific datasets. We filter candidate lists in post-processing to improve accuracy. We demonstrate that our model can work in a) a zero shot setting (where we only require a pre-trained LM), b) a fine-tuned setting (where language-specific data is required), and c) a multilingual setting (where the model is pre-trained across multiple languages and fine-tuned in an specific language). Experimental results show that, although the zero-shot setting is competitive, its performance is still far from the fine-tuned setting. Also, the multilingual is unsurprisingly worse than the fine-tuned model. Among all TSAR-2022 Shared Task participants, our team was ranked second in Spanish and third in English.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122199486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Controlling Japanese Machine Translation Output by Using JLPT Vocabulary Levels 用日语翻译词汇水平控制日语机器翻译输出
Alberto Poncelas, Ohnmar Htun
In Neural Machine Translation (NMT) systems, there is generally little control over the lexicon of the output. Consequently, the translated output may be too difficult for certain audiences. For example, for people with limited knowledge of the language, vocabulary is a major impediment to understanding a text. In this work, we build a complexity-controllable NMT for English-to-Japanese translations. More particularly, we aim to modulate the difficulty of the translation in terms of not only the vocabulary but also the use of kanji. For achieving this, we follow a sentence-tagging approach to influence the output.Controlling Japanese Machine Translation Output by Using JLPT Vocabulary Levels.
在神经机器翻译(NMT)系统中,通常很少对输出的词汇进行控制。因此,翻译后的输出对某些听众来说可能太困难了。例如,对于语言知识有限的人来说,词汇是理解文本的主要障碍。在这项工作中,我们为英语到日语的翻译建立了一个复杂性可控的NMT。更具体地说,我们的目标是在词汇和汉字的使用方面调节翻译的难度。为了实现这一点,我们采用了一种句子标记方法来影响输出。用日语翻译词汇水平控制日语机器翻译输出。
{"title":"Controlling Japanese Machine Translation Output by Using JLPT Vocabulary Levels","authors":"Alberto Poncelas, Ohnmar Htun","doi":"10.18653/v1/2022.tsar-1.7","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.7","url":null,"abstract":"In Neural Machine Translation (NMT) systems, there is generally little control over the lexicon of the output. Consequently, the translated output may be too difficult for certain audiences. For example, for people with limited knowledge of the language, vocabulary is a major impediment to understanding a text. In this work, we build a complexity-controllable NMT for English-to-Japanese translations. More particularly, we aim to modulate the difficulty of the translation in terms of not only the vocabulary but also the use of kanji. For achieving this, we follow a sentence-tagging approach to influence the output.Controlling Japanese Machine Translation Output by Using JLPT Vocabulary Levels.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"571 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123154123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PresiUniv at TSAR-2022 Shared Task: Generation and Ranking of Simplification Substitutes of Complex Words in Multiple Languages 多语言复杂词语简化替代词的生成与排序
Peniel Whistely, Sandeep Albert Mathias, Galiveeti Poornima
In this paper, we describe our approach to generate and rank candidate simplifications using pre-trained language models (Eg. BERT), publicly available word embeddings (Eg. FastText), and a part-of-speech tagger, to generate and rank candidate contextual simplifications for a given complex word. In this task, our system, PresiUniv, was placed first in the Spanish track, 5th in the Brazilian-Portuguese track, and 10th in the English track. We upload our codes and data for this project to aid in replication of our results. We also analyze some of the errors and describe design decisions which we took while writing the paper.
在本文中,我们描述了我们使用预训练的语言模型(例如:BERT),公开可用的词嵌入(例如;FastText)和词性标注器,用于生成给定复杂单词的候选上下文简化并对其进行排序。在这项任务中,我们的系统PresiUniv在西班牙语组中排名第一,在巴西-葡萄牙语组中排名第五,在英语组中排名第十。我们为这个项目上传代码和数据,以帮助复制我们的结果。我们还分析了一些错误,并描述了我们在撰写论文时所做的设计决策。
{"title":"PresiUniv at TSAR-2022 Shared Task: Generation and Ranking of Simplification Substitutes of Complex Words in Multiple Languages","authors":"Peniel Whistely, Sandeep Albert Mathias, Galiveeti Poornima","doi":"10.18653/v1/2022.tsar-1.22","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.22","url":null,"abstract":"In this paper, we describe our approach to generate and rank candidate simplifications using pre-trained language models (Eg. BERT), publicly available word embeddings (Eg. FastText), and a part-of-speech tagger, to generate and rank candidate contextual simplifications for a given complex word. In this task, our system, PresiUniv, was placed first in the Spanish track, 5th in the Brazilian-Portuguese track, and 10th in the English track. We upload our codes and data for this project to aid in replication of our results. We also analyze some of the errors and describe design decisions which we took while writing the paper.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132240447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An Investigation into the Effect of Control Tokens on Text Simplification 控制符号对文本简化影响的研究
Zihao Li, M. Shardlow, Saeed Hassan
Recent work on text simplification has focused on the use of control tokens to further the state of the art. However, it is not easy to further improve without an in-depth comprehension of the mechanisms underlying control tokens. One unexplored factor is the tokenisation strategy, which we also explore. In this paper, we (1) reimplemented ACCESS, (2) explored the effects of varying control tokens, (3) tested the influences of different tokenisation strategies, and (4) demonstrated how separate control tokens affect performance. We show variations of performance in the four control tokens separately. We also uncover how the design of control tokens could influence the performance and propose some suggestions for designing control tokens, which also reaches into other controllable text generation tasks.
最近关于文本简化的工作主要集中在使用控制令牌来进一步提高技术水平。然而,如果不深入理解控制令牌背后的机制,就不容易进一步改进。一个未被探索的因素是标记化策略,我们也在探索。在本文中,我们(1)重新实现了ACCESS,(2)探索了不同控制令牌的影响,(3)测试了不同令牌化策略的影响,(4)展示了单独的控制令牌如何影响性能。我们将分别展示四种控制令牌的性能变化。我们还揭示了控制令牌的设计如何影响性能,并提出了一些设计控制令牌的建议,这些建议也适用于其他可控文本生成任务。
{"title":"An Investigation into the Effect of Control Tokens on Text Simplification","authors":"Zihao Li, M. Shardlow, Saeed Hassan","doi":"10.18653/v1/2022.tsar-1.14","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.14","url":null,"abstract":"Recent work on text simplification has focused on the use of control tokens to further the state of the art. However, it is not easy to further improve without an in-depth comprehension of the mechanisms underlying control tokens. One unexplored factor is the tokenisation strategy, which we also explore. In this paper, we (1) reimplemented ACCESS, (2) explored the effects of varying control tokens, (3) tested the influences of different tokenisation strategies, and (4) demonstrated how separate control tokens affect performance. We show variations of performance in the four control tokens separately. We also uncover how the design of control tokens could influence the performance and propose some suggestions for designing control tokens, which also reaches into other controllable text generation tasks.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133694526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
CENTAL at TSAR-2022 Shared Task: How Does Context Impact BERT-Generated Substitutions for Lexical Simplification? TSAR-2022共享任务:上下文如何影响bert生成的词汇简化替代?
Rodrigo Wilkens, David Alfter, Rémi Cardon, Isabelle Gribomont, Adrien Bibal, Watrin Patrick, Marie-Catherine de Marneffe, Thomas François
Lexical simplification is the task of substituting a difficult word with a simpler equivalent for a target audience. This is currently commonly done by modeling lexical complexity on a continuous scale to identify simpler alternatives to difficult words. In the TSAR shared task, the organizers call for systems capable of generating substitutions in a zero-shot-task context, for English, Spanish and Portuguese. In this paper, we present the solution we (the {textsc{cental} team) proposed for the task. We explore the ability of BERT-like models to generate substitution words by masking the difficult word. To do so, we investigate various context enhancement strategies, that we combined into an ensemble method. We also explore different substitution ranking methods. We report on a post-submission analysis of the results and present our insights for potential improvements. The code for all our experiments is available at https://gitlab.com/Cental-FR/cental-tsar2022.
词汇简化是指为目标受众用较简单的对应词代替较难的词。目前,这通常是通过在连续尺度上对词汇复杂性进行建模来识别难词的简单替代词来完成的。在TSAR共享任务中,组织者要求系统能够在零射门任务上下文中生成替换,用于英语,西班牙语和葡萄牙语。在本文中,我们提出了我们({textsc{central}团队)为该任务提出的解决方案。我们探索了类bert模型通过屏蔽难词来生成替代词的能力。为此,我们研究了各种上下文增强策略,并将其组合成一个集成方法。我们还探讨了不同的替代排序方法。我们报告提交后的结果分析,并提出我们对潜在改进的见解。我们所有实验的代码都可以在https://gitlab.com/Cental-FR/cental-tsar2022上找到。
{"title":"CENTAL at TSAR-2022 Shared Task: How Does Context Impact BERT-Generated Substitutions for Lexical Simplification?","authors":"Rodrigo Wilkens, David Alfter, Rémi Cardon, Isabelle Gribomont, Adrien Bibal, Watrin Patrick, Marie-Catherine de Marneffe, Thomas François","doi":"10.18653/v1/2022.tsar-1.25","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.25","url":null,"abstract":"Lexical simplification is the task of substituting a difficult word with a simpler equivalent for a target audience. This is currently commonly done by modeling lexical complexity on a continuous scale to identify simpler alternatives to difficult words. In the TSAR shared task, the organizers call for systems capable of generating substitutions in a zero-shot-task context, for English, Spanish and Portuguese. In this paper, we present the solution we (the {textsc{cental} team) proposed for the task. We explore the ability of BERT-like models to generate substitution words by masking the difficult word. To do so, we investigate various context enhancement strategies, that we combined into an ensemble method. We also explore different substitution ranking methods. We report on a post-submission analysis of the results and present our insights for potential improvements. The code for all our experiments is available at https://gitlab.com/Cental-FR/cental-tsar2022.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"28 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116211407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Dataset of Word-Complexity Judgements from Deaf and Hard-of-Hearing Adults for Text Simplification 聋人与听力障碍者文本简化的词复杂度判断数据集
Oliver Alonzo, Sooyeon Lee, Mounica Maddela, Wei Xu, Matt Huenerfauth
Research has explored the use of automatic text simplification (ATS), which consists of techniques to make text simpler to read, to provide reading assistance to Deaf and Hard-of-hearing (DHH) adults with various literacy levels. Prior work in this area has identified interest in and benefits from ATS-based reading assistance tools. However, no prior work on ATS has gathered judgements from DHH adults as to what constitutes complex text. Thus, following approaches in prior NLP work, this paper contributes new word-complexity judgements from 11 DHH adults on a dataset of 15,000 English words that had been previously annotated by L2 speakers, which we also augmented to include automatic annotations of linguistic characteristics of the words. Additionally, we conduct a supplementary analysis of the interaction effect between the linguistic characteristics of the words and the groups of annotators. This analysis highlights the importance of collecting judgements from DHH adults for training ATS systems, as it revealed statistically significant interaction effects for nearly all of the linguistic characteristics of the words.
研究已经探索了使用自动文本简化(ATS),它包括使文本更容易阅读的技术,为不同文化水平的聋人和听力障碍(DHH)成年人提供阅读帮助。该领域先前的工作已经确定了基于ats的阅读辅助工具的兴趣和益处。然而,以前没有关于ATS的工作收集了DHH成人关于什么构成复杂文本的判断。因此,根据之前NLP工作的方法,本文在一个包含15,000个英语单词的数据集上贡献了来自11名DHH成年人的新的单词复杂性判断,这些单词之前已经被L2说话者注释过,我们还增强了这些单词的语言特征的自动注释。此外,我们还对词语的语言特征与注释者群体之间的互动效应进行了补充分析。这一分析强调了从DHH成人收集判断对训练ATS系统的重要性,因为它揭示了几乎所有单词的语言特征的统计显着的相互作用效应。
{"title":"A Dataset of Word-Complexity Judgements from Deaf and Hard-of-Hearing Adults for Text Simplification","authors":"Oliver Alonzo, Sooyeon Lee, Mounica Maddela, Wei Xu, Matt Huenerfauth","doi":"10.18653/v1/2022.tsar-1.11","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.11","url":null,"abstract":"Research has explored the use of automatic text simplification (ATS), which consists of techniques to make text simpler to read, to provide reading assistance to Deaf and Hard-of-hearing (DHH) adults with various literacy levels. Prior work in this area has identified interest in and benefits from ATS-based reading assistance tools. However, no prior work on ATS has gathered judgements from DHH adults as to what constitutes complex text. Thus, following approaches in prior NLP work, this paper contributes new word-complexity judgements from 11 DHH adults on a dataset of 15,000 English words that had been previously annotated by L2 speakers, which we also augmented to include automatic annotations of linguistic characteristics of the words. Additionally, we conduct a supplementary analysis of the interaction effect between the linguistic characteristics of the words and the groups of annotators. This analysis highlights the importance of collecting judgements from DHH adults for training ATS systems, as it revealed statistically significant interaction effects for nearly all of the linguistic characteristics of the words.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120879467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1