首页 > 最新文献

Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)最新文献

英文 中文
IrekiaLFes: a New Open Benchmark and Baseline Systems for Spanish Automatic Text Simplification IrekiaLFes:西班牙语自动文本简化的一个新的开放基准和基线系统
Itziar Gonzalez-Dios, Iker Gutiérrez-Fandiño, Oscar M. Cumbicus-Pineda, Aitor Soroa Etxabe
Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.
自动文本简化(ATS)旨在为公众或目标受众降低文本的复杂性。在过去的几年里,深度学习方法已经成为ATS研究中使用最多的系统,但是这些系统需要大量高质量的数据集来进行评估。此外,这些数据只能大规模地用于英语,在某些情况下还需要限制性许可。在本文中,我们提出了IrekiaLF_es,这是一个西班牙语文本简化的开放许可基准。它由文档级语料库和手动对齐的句子级测试集组成。我们还对语料库进行了基于神经语言学的评估,以揭示其对文本简化的适用性。该评估遵循神经语言复杂性评估的词典-统一-线性(LeULi)模型。最后,我们给出了一组零射击场景下ATS系统的实验和基线。
{"title":"IrekiaLFes: a New Open Benchmark and Baseline Systems for Spanish Automatic Text Simplification","authors":"Itziar Gonzalez-Dios, Iker Gutiérrez-Fandiño, Oscar M. Cumbicus-Pineda, Aitor Soroa Etxabe","doi":"10.18653/v1/2022.tsar-1.8","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.8","url":null,"abstract":"Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129496224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
RCML at TSAR-2022 Shared Task: Lexical Simplification With Modular Substitution Candidate Ranking RCML在TSAR-2022共享任务:词汇简化与模块化替代候选排序
Desislava Aleksandrova, Olivier Brochu Dufour
This paper describes the lexical simplification system RCML submitted to the English language track of the TSAR-2022 Shared Task. The system leverages a pre-trained language model to generate contextually plausible substitution candidates which are then ranked according to their simplicity as well as their grammatical and semantic similarity to the target complex word. Our submissions secure 6th and 7th places out of 33, improving over the SOTA baseline for 27 out of the 51 metrics.
本文描述了提交给TSAR-2022共享任务英语语言轨道的词汇简化系统RCML。该系统利用预先训练的语言模型来生成上下文合理的替代候选词,然后根据其简单性以及与目标复杂词的语法和语义相似性进行排名。我们的提交在33项中获得了第6和第7名,在51项指标中有27项比SOTA基线有所提高。
{"title":"RCML at TSAR-2022 Shared Task: Lexical Simplification With Modular Substitution Candidate Ranking","authors":"Desislava Aleksandrova, Olivier Brochu Dufour","doi":"10.18653/v1/2022.tsar-1.29","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.29","url":null,"abstract":"This paper describes the lexical simplification system RCML submitted to the English language track of the TSAR-2022 Shared Task. The system leverages a pre-trained language model to generate contextually plausible substitution candidates which are then ranked according to their simplicity as well as their grammatical and semantic similarity to the target complex word. Our submissions secure 6th and 7th places out of 33, improving over the SOTA baseline for 27 out of the 51 metrics.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130148753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Improving Text Simplification with Factuality Error Detection 用事实错误检测改进文本简化
Yuan Ma, Sandaru Seneviratne, E. Daskalaki
In the past few years, the field of text simplification has been dominated by supervised learning approaches thanks to the appearance of large parallel datasets such as Wikilarge and Newsela. However, these datasets suffer from sentence pairs with factuality errors which compromise the models’ performance. So, we proposed a model-independent factuality error detection mechanism, considering bad simplification and bad alignment, to refine the Wikilarge dataset through reducing the weight of these samples during training. We demonstrated that this approach improved the performance of the state-of-the-art text simplification model TST5 by an FKGL reduction of 0.33 and 0.29 on the TurkCorpus and ASSET testing datasets respectively. Our study illustrates the impact of erroneous samples in TS datasets and highlights the need for automatic methods to improve their quality.
在过去的几年中,由于出现了大型并行数据集(如Wikilarge和Newsela),文本简化领域一直由监督学习方法主导。然而,这些数据集受到具有事实性错误的句子对的影响,从而影响了模型的性能。因此,我们提出了一种模型无关的事实性错误检测机制,考虑到糟糕的简化和糟糕的对齐,通过在训练过程中降低这些样本的权重来改进wikillarge数据集。我们证明,这种方法提高了最先进的文本简化模型TST5的性能,在TurkCorpus和ASSET测试数据集上,FKGL分别降低了0.33和0.29。我们的研究说明了TS数据集中错误样本的影响,并强调了自动化方法提高其质量的必要性。
{"title":"Improving Text Simplification with Factuality Error Detection","authors":"Yuan Ma, Sandaru Seneviratne, E. Daskalaki","doi":"10.18653/v1/2022.tsar-1.16","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.16","url":null,"abstract":"In the past few years, the field of text simplification has been dominated by supervised learning approaches thanks to the appearance of large parallel datasets such as Wikilarge and Newsela. However, these datasets suffer from sentence pairs with factuality errors which compromise the models’ performance. So, we proposed a model-independent factuality error detection mechanism, considering bad simplification and bad alignment, to refine the Wikilarge dataset through reducing the weight of these samples during training. We demonstrated that this approach improved the performance of the state-of-the-art text simplification model TST5 by an FKGL reduction of 0.33 and 0.29 on the TurkCorpus and ASSET testing datasets respectively. Our study illustrates the impact of erroneous samples in TS datasets and highlights the need for automatic methods to improve their quality.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130188820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CILS at TSAR-2022 Shared Task: Investigating the Applicability of Lexical Substitution Methods for Lexical Simplification CILS在TSAR-2022共享任务:探讨词汇替代方法对词汇简化的适用性
Sandaru Seneviratne, E. Daskalaki, H. Suominen
Lexical simplification — which aims to simplify complex text through the replacement of difficult words using simpler alternatives while maintaining the meaning of the given text — is popular as a way of improving text accessibility for both people and computers. First, lexical simplification through substitution can improve the understandability of complex text for, for example, non-native speakers, second language learners, and people with low literacy. Second, its usefulness has been demonstrated in many natural language processing problems like data augmentation, paraphrase generation, or word sense induction. In this paper, we investigated the applicability of existing unsupervised lexical substitution methods based on pre-trained contextual embedding models and WordNet, which incorporate Context Information, for Lexical Simplification (CILS). Although the performance of this CILS approach has been outstanding in lexical substitution tasks, its usefulness was limited at the TSAR-2022 shared task on lexical simplification. Consequently, a minimally supervised approach with careful tuning to a given simplification task may work better than unsupervised methods. Our investigation also encouraged further work on evaluating the simplicity of potential candidates and incorporating them into the lexical simplification methods.
词汇简化——旨在通过使用更简单的替代词来代替难的词来简化复杂的文本,同时保持给定文本的意思——作为一种提高人和计算机文本可访问性的方法而流行。首先,通过替代来简化词汇可以提高复杂文本的可理解性,例如,非母语人士、第二语言学习者和文化水平低的人。其次,它的有用性已经在许多自然语言处理问题中得到证明,如数据增强、释义生成或词义归纳。本文研究了现有的基于预训练上下文嵌入模型和WordNet的无监督词汇替代方法在词汇简化(CILS)中的适用性。尽管这种CILS方法在词汇替换任务中表现突出,但在TSAR-2022词汇简化共享任务中,其实用性受到限制。因此,对给定的简化任务进行仔细调整的最低限度监督方法可能比无监督方法更好。我们的调查还鼓励进一步评估潜在候选词的简单性,并将其纳入词汇简化方法。
{"title":"CILS at TSAR-2022 Shared Task: Investigating the Applicability of Lexical Substitution Methods for Lexical Simplification","authors":"Sandaru Seneviratne, E. Daskalaki, H. Suominen","doi":"10.18653/v1/2022.tsar-1.21","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.21","url":null,"abstract":"Lexical simplification — which aims to simplify complex text through the replacement of difficult words using simpler alternatives while maintaining the meaning of the given text — is popular as a way of improving text accessibility for both people and computers. First, lexical simplification through substitution can improve the understandability of complex text for, for example, non-native speakers, second language learners, and people with low literacy. Second, its usefulness has been demonstrated in many natural language processing problems like data augmentation, paraphrase generation, or word sense induction. In this paper, we investigated the applicability of existing unsupervised lexical substitution methods based on pre-trained contextual embedding models and WordNet, which incorporate Context Information, for Lexical Simplification (CILS). Although the performance of this CILS approach has been outstanding in lexical substitution tasks, its usefulness was limited at the TSAR-2022 shared task on lexical simplification. Consequently, a minimally supervised approach with careful tuning to a given simplification task may work better than unsupervised methods. Our investigation also encouraged further work on evaluating the simplicity of potential candidates and incorporating them into the lexical simplification methods.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128585958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Lexically Constrained Decoding with Edit Operation Prediction for Controllable Text Simplification 基于编辑操作预测的可控文本简化的词法约束解码
Tatsuya Zetsu, Tomoyuki Kajiwara, Yuki Arase
Controllable text simplification assists language learners by automatically rewriting complex sentences into simpler forms of a target level. However, existing methods tend to perform conservative edits that keep complex words intact. To address this problem, we employ lexically constrained decoding to encourage rewriting. Specifically, the proposed method predicts edit operations conditioned to a target level and creates positive/negative constraints for words that should/should not appear in an output sentence. The experimental results confirm that our method significantly outperforms previous methods and demonstrates a new state-of-the-art performance.
可控文本简化通过自动将复杂句子改写成目标水平的更简单形式来帮助语言学习者。然而,现有的方法倾向于执行保守的编辑,以保持复杂的单词完整。为了解决这个问题,我们使用词法约束解码来鼓励重写。具体来说,建议的方法预测目标级别的编辑操作,并为输出句子中应该/不应该出现的单词创建积极/消极约束。实验结果证实,我们的方法明显优于以前的方法,并展示了新的最先进的性能。
{"title":"Lexically Constrained Decoding with Edit Operation Prediction for Controllable Text Simplification","authors":"Tatsuya Zetsu, Tomoyuki Kajiwara, Yuki Arase","doi":"10.18653/v1/2022.tsar-1.13","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.13","url":null,"abstract":"Controllable text simplification assists language learners by automatically rewriting complex sentences into simpler forms of a target level. However, existing methods tend to perform conservative edits that keep complex words intact. To address this problem, we employ lexically constrained decoding to encourage rewriting. Specifically, the proposed method predicts edit operations conditioned to a target level and creates positive/negative constraints for words that should/should not appear in an output sentence. The experimental results confirm that our method significantly outperforms previous methods and demonstrates a new state-of-the-art performance.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129265239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models TSAR-2022共享任务:评价词汇简化模型
Kai North, Alphaeus Dmonte, Tharindu Ranasinghe, Marcos Zampieri
This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of the task is to automatically provide a set of candidate substitutions for complex words in context. The organizers provided participants with ALEXSIS a manually annotated dataset with instances split between a small trial set with a dozen instances in each of the three languages of the competition (English, Portuguese, Spanish) and a test set with over 300 instances in the three aforementioned languages. To cope with the lack of training data, participants had to either use alternative data sources or pre-trained language models. We experimented with monolingual models: BERTimbau, ELECTRA, and RoBERTA-largeBNE. Our best system achieved 1st place out of sixteen systems for Portuguese, 8th out of thirty-three systems for English, and 6th out of twelve systems for Spanish.
本文描述了GMU-WLV团队提交给TSAR的多语言词汇简化共享任务。该任务的目标是自动为上下文中的复杂单词提供一组候选替换。组织者为参与者提供了ALEXSIS一个手动注释的数据集,其中的实例分为两个部分:一个小的试验集,其中每一种都有十几个实例,使用竞赛的三种语言(英语、葡萄牙语、西班牙语);一个测试集,其中有超过300个实例,使用上述三种语言。为了解决缺乏训练数据的问题,参与者必须使用替代数据源或预先训练的语言模型。我们用单语模型进行了实验:BERTimbau、ELECTRA和RoBERTA-largeBNE。我们最好的系统在16个葡萄牙语系统中获得了第一名,在33个英语系统中获得了第八名,在12个西班牙语系统中获得了第六名。
{"title":"GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models","authors":"Kai North, Alphaeus Dmonte, Tharindu Ranasinghe, Marcos Zampieri","doi":"10.18653/v1/2022.tsar-1.30","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.30","url":null,"abstract":"This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of the task is to automatically provide a set of candidate substitutions for complex words in context. The organizers provided participants with ALEXSIS a manually annotated dataset with instances split between a small trial set with a dozen instances in each of the three languages of the competition (English, Portuguese, Spanish) and a test set with over 300 instances in the three aforementioned languages. To cope with the lack of training data, participants had to either use alternative data sources or pre-trained language models. We experimented with monolingual models: BERTimbau, ELECTRA, and RoBERTA-largeBNE. Our best system achieved 1st place out of sixteen systems for Portuguese, 8th out of thirty-three systems for English, and 6th out of twelve systems for Spanish.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125120739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Target-Level Sentence Simplification as Controlled Paraphrasing 目标层次的句子简化与控制释义
Tannon Kew, Sarah Ebling
Automatic text simplification aims to reduce the linguistic complexity of a text in order to make it easier to understand and more accessible. However, simplified texts are consumed by a diverse array of target audiences and what might be appropriately simplified for one group of readers may differ considerably for another. In this work we investigate a novel formulation of sentence simplification as paraphrasing with controlled decoding. This approach aims to alleviate the major burden of relying on large amounts of in-domain parallel training data, while at the same time allowing for modular and adaptive simplification. According to automatic metrics, our approach performs competitively against baselines that prove more difficult to adapt to the needs of different target audiences or require significant amounts of complex-simple parallel aligned data.
自动文本简化旨在降低文本的语言复杂性,使其更易于理解和访问。然而,简化的文本是由各种各样的目标受众消费的,对于一组读者可能适当简化的内容对于另一组读者可能有很大的不同。在这项工作中,我们研究了一种新的句子简化形式,即控制解码的释义。该方法旨在减轻依赖大量领域内并行训练数据的主要负担,同时允许模块化和自适应简化。根据自动度量,我们的方法与基线相比具有竞争力,这些基线被证明更难以适应不同目标受众的需求,或者需要大量复杂-简单的并行对齐数据。
{"title":"Target-Level Sentence Simplification as Controlled Paraphrasing","authors":"Tannon Kew, Sarah Ebling","doi":"10.18653/v1/2022.tsar-1.4","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.4","url":null,"abstract":"Automatic text simplification aims to reduce the linguistic complexity of a text in order to make it easier to understand and more accessible. However, simplified texts are consumed by a diverse array of target audiences and what might be appropriately simplified for one group of readers may differ considerably for another. In this work we investigate a novel formulation of sentence simplification as paraphrasing with controlled decoding. This approach aims to alleviate the major burden of relying on large amounts of in-domain parallel training data, while at the same time allowing for modular and adaptive simplification. According to automatic metrics, our approach performs competitively against baselines that prove more difficult to adapt to the needs of different target audiences or require significant amounts of complex-simple parallel aligned data.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125920266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Patient-friendly Clinical Notes: Towards a new Text Simplification Dataset 患者友好型临床笔记:迈向新的文本简化数据集
Jan Trienes, Jörg Schlötterer, H. Schildhaus, C. Seifert
Automatic text simplification can help patients to better understand their own clinical notes. A major hurdle for the development of clinical text simplification methods is the lack of high quality resources. We report ongoing efforts in creating a parallel dataset of professionally simplified clinical notes. Currently, this corpus consists of 851 document-level simplifications of German pathology reports. We highlight characteristics of this dataset and establish first baselines for paragraph-level simplification.
自动简化文本可以帮助患者更好地理解自己的临床记录。临床文本简化方法发展的主要障碍是缺乏高质量的资源。我们报告正在努力创建一个专业简化临床记录的并行数据集。目前,该语料库由851个德语病理报告的文档级简化组成。我们突出了该数据集的特征,并为段落级简化建立了第一个基线。
{"title":"Patient-friendly Clinical Notes: Towards a new Text Simplification Dataset","authors":"Jan Trienes, Jörg Schlötterer, H. Schildhaus, C. Seifert","doi":"10.18653/v1/2022.tsar-1.3","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.3","url":null,"abstract":"Automatic text simplification can help patients to better understand their own clinical notes. A major hurdle for the development of clinical text simplification methods is the lack of high quality resources. We report ongoing efforts in creating a parallel dataset of professionally simplified clinical notes. Currently, this corpus consists of 851 document-level simplifications of German pathology reports. We highlight characteristics of this dataset and establish first baselines for paragraph-level simplification.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133770101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers JADES:针对非母语人士的日语文本简化新数据集
Akio Hayakawa, Tomoyuki Kajiwara, Hiroki Ouchi, Taro Watanabe
The user-dependency of Text Simplification makes its evaluation obscure. A targeted evaluation dataset clarifies the purpose of simplification, though its specification is hard to define. We built JADES (JApanese Dataset for the Evaluation of Simplification), a text simplification dataset targeted at non-native Japanese speakers, according to public vocabulary and grammar profiles. JADES comprises 3,907 complex-simple sentence pairs annotated by an expert. Analysis of JADES shows that wide and multiple rewriting operations were applied through simplification. Furthermore, we analyzed outputs on JADES from several benchmark systems and automatic and manual scores of them. Results of these analyses highlight differences between English and Japanese in operations and evaluations.
文本简化的用户依赖性使其评价模糊不清。有针对性的评估数据集阐明了简化的目的,尽管它的规范很难定义。我们根据公开的词汇和语法配置文件建立了JADES(日语简化评估数据集),这是一个针对非日语母语人士的文本简化数据集。JADES包括3907个由专家注释的复杂-简单句子对。对JADES的分析表明,通过简化,应用了广泛且多次的重写操作。此外,我们分析了来自几个基准系统的JADES输出以及它们的自动和手动分数。这些分析结果突出了英语和日语在操作和评价方面的差异。
{"title":"JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers","authors":"Akio Hayakawa, Tomoyuki Kajiwara, Hiroki Ouchi, Taro Watanabe","doi":"10.18653/v1/2022.tsar-1.17","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.17","url":null,"abstract":"The user-dependency of Text Simplification makes its evaluation obscure. A targeted evaluation dataset clarifies the purpose of simplification, though its specification is hard to define. We built JADES (JApanese Dataset for the Evaluation of Simplification), a text simplification dataset targeted at non-native Japanese speakers, according to public vocabulary and grammar profiles. JADES comprises 3,907 complex-simple sentence pairs annotated by an expert. Analysis of JADES shows that wide and multiple rewriting operations were applied through simplification. Furthermore, we analyzed outputs on JADES from several benchmark systems and automatic and manual scores of them. Results of these analyses highlight differences between English and Japanese in operations and evaluations.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133089397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
teamPN at TSAR-2022 Shared Task: Lexical Simplification using Multi-Level and Modular Approach teamPN在TSAR-2022共享任务:使用多层次和模块化方法的词汇简化
Nikita Nikita, P. Rajpoot
Lexical Simplification is the process of reducing the lexical complexity of a text by replacing difficult words with easier-to-read (or understand) expressions while preserving the original information and meaning. This paper explains the work done by our team “teamPN” for the English track of TSAR 2022 Shared Task of Lexical Simplification. We created a modular pipeline which combines transformers based models with traditional NLP methods like paraphrasing and verb sense disambiguation. We created a multi-level and modular pipeline where the target text is treated according to its semantics (Part of Speech Tag). The pipeline is multi-level as we utilize multiple source models to find potential candidates for replacement. It is modular as we can switch the source models and their weighting in the final re-ranking.
词汇简化是在保留原文信息和意思的情况下,用容易阅读(或理解)的表达代替难理解的单词,从而降低文本词汇复杂性的过程。本文介绍了我们团队“teamPN”为TSAR 2022词汇简化共享任务的英语轨道所做的工作。我们创建了一个模块化的管道,将基于变形器的模型与传统的NLP方法(如释义和动词语义消歧)相结合。我们创建了一个多级和模块化的管道,其中目标文本根据其语义(Part of Speech Tag)进行处理。管道是多层次的,因为我们利用多个源模型来寻找潜在的替代候选。它是模块化的,因为我们可以在最终的重新排序中切换源模型和它们的权重。
{"title":"teamPN at TSAR-2022 Shared Task: Lexical Simplification using Multi-Level and Modular Approach","authors":"Nikita Nikita, P. Rajpoot","doi":"10.18653/v1/2022.tsar-1.26","DOIUrl":"https://doi.org/10.18653/v1/2022.tsar-1.26","url":null,"abstract":"Lexical Simplification is the process of reducing the lexical complexity of a text by replacing difficult words with easier-to-read (or understand) expressions while preserving the original information and meaning. This paper explains the work done by our team “teamPN” for the English track of TSAR 2022 Shared Task of Lexical Simplification. We created a modular pipeline which combines transformers based models with traditional NLP methods like paraphrasing and verb sense disambiguation. We created a multi-level and modular pipeline where the target text is treated according to its semantics (Part of Speech Tag). The pipeline is multi-level as we utilize multiple source models to find potential candidates for replacement. It is modular as we can switch the source models and their weighting in the final re-ranking.","PeriodicalId":247582,"journal":{"name":"Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114547350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1