Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)最新文献

System Description for the CommonGen task with the POINTER model 使用POINTER模型的commonen任务的系统描述

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.gem-1.15

Anna Shvets

In a current experiment we were testing CommonGen dataset for structure-to-text task from GEM living benchmark with the constraint based POINTER model. POINTER represents a hybrid architecture, combining insertion-based and transformer paradigms, predicting the token and the insertion position at the same time. The text is therefore generated gradually in a parallel non-autoregressive manner, given the set of keywords. The pretrained model was fine-tuned on a training split of the CommonGen dataset and the generation result was compared to the validation and challenge splits. The received metrics outputs, which measure lexical equivalence, semantic similarity and diversity, are discussed in details in a present system description.

在当前的实验中，我们使用基于约束的POINTER模型测试GEM活基准的commonen数据集的结构到文本任务。POINTER代表了一种混合架构，结合了基于插入的范式和转换范式，同时预测令牌和插入位置。因此，给定一组关键字，文本以并行非自回归的方式逐渐生成。在CommonGen数据集的训练分裂上对预训练模型进行微调，并将生成结果与验证分裂和挑战分裂进行比较。在本系统描述中详细讨论了接收到的度量输出，这些度量输出测量词汇等效性、语义相似性和多样性。

引用次数: 1

NUIG-DSI’s submission to The GEM Benchmark 2021 NUIG-DSI提交给创业板基准2021

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.gem-1.13

Nivranshu Pasricha, Mihael Arcan, P. Buitelaar

This paper describes the submission by NUIG-DSI to the GEM benchmark 2021. We participate in the modeling shared task where we submit outputs on four datasets for data-to-text generation, namely, DART, WebNLG (en), E2E and CommonGen. We follow an approach similar to the one described in the GEM benchmark paper where we use the pre-trained T5-base model for our submission. We train this model on additional monolingual data where we experiment with different masking strategies specifically focused on masking entities, predicates and concepts as well as a random masking strategy for pre-training. In our results we find that random masking performs the best in terms of automatic evaluation metrics, though the results are not statistically significantly different compared to other masking strategies.

本文描述了NUIG-DSI提交给GEM基准2021的情况。我们参与了建模共享任务，我们提交了四个数据集的输出，用于数据到文本的生成，即DART、WebNLG (en)、E2E和commonen。我们遵循类似于GEM基准论文中描述的方法，我们在提交中使用预训练的t5基模型。我们在额外的单语数据上训练这个模型，在那里我们实验了不同的掩蔽策略，特别关注掩蔽实体、谓词和概念，以及用于预训练的随机掩蔽策略。在我们的结果中，我们发现随机掩蔽在自动评估指标方面表现最好，尽管结果与其他掩蔽策略相比没有统计学上的显着差异。

引用次数: 0

SimpleNER Sentence Simplification System for GEM 2021 SimpleNER GEM 2021句子简化系统

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.gem-1.14

KV Aditya Srivatsa, Monil Gokani, Manish Shrivastava

This paper describes SimpleNER, a model developed for the sentence simplification task at GEM-2021. Our system is a monolingual Seq2Seq Transformer architecture that uses control tokens pre-pended to the data, allowing the model to shape the generated simplifications according to user desired attributes. Additionally, we show that NER-tagging the training data before use helps stabilize the effect of the control tokens and significantly improves the overall performance of the system. We also employ pretrained embeddings to reduce data sparsity and allow the model to produce more generalizable outputs.

本文介绍了SimpleNER，这是一个为GEM-2021的句子简化任务开发的模型。我们的系统是单语言Seq2Seq Transformer体系结构，它使用预先挂起到数据的控制令牌，允许模型根据用户所需的属性塑造生成的简化。此外，我们表明，在使用前对训练数据进行自定义标记有助于稳定控制令牌的效果，并显着提高系统的整体性能。我们还使用预训练的嵌入来降低数据稀疏性，并允许模型产生更一般化的输出。

引用次数: 1

Human Perception in Natural Language Generation 自然语言生成中的人类感知

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.gem-1.2

Lorenzo De Mattei, Huiyuan Lai, F. Dell’Orletta, M. Nissim

We ask subjects whether they perceive as human-produced a bunch of texts, some of which are actually human-written, while others are automatically generated. We use this data to fine-tune a GPT-2 model to push it to generate more human-like texts, and observe that this fine-tuned model produces texts that are indeed perceived more human-like than the original model. Contextually, we show that our automatic evaluation strategy well correlates with human judgements. We also run a linguistic analysis to unveil the characteristics of human- vs machine-perceived language.

我们问受试者，他们是否认为一堆文本是人类产生的，其中一些实际上是人类编写的，而另一些是自动生成的。我们使用这些数据对GPT-2模型进行微调，以促使它生成更像人类的文本，并观察到这个微调模型产生的文本确实比原始模型更像人类。在上下文中，我们表明我们的自动评估策略与人类的判断很好地相关。我们还进行了语言分析，以揭示人类和机器感知语言的特征。

引用次数: 3

Semantic Similarity Based Evaluation for Abstractive News Summarization 基于语义相似度的新闻文摘评价

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.gem-1.3

Figen Beken Fikri, Kemal Oflazer, B. Yanikoglu

ROUGE is a widely used evaluation metric in text summarization. However, it is not suitable for the evaluation of abstractive summarization systems as it relies on lexical overlap between the gold standard and the generated summaries. This limitation becomes more apparent for agglutinative languages with very large vocabularies and high type/token ratios. In this paper, we present semantic similarity models for Turkish and apply them as evaluation metrics for an abstractive summarization task. To achieve this, we translated the English STSb dataset into Turkish and presented the first semantic textual similarity dataset for Turkish as well. We showed that our best similarity models have better alignment with average human judgments compared to ROUGE in both Pearson and Spearman correlations.

ROUGE是一种广泛应用于文本摘要的评价度量。然而，它不适合评估抽象摘要系统，因为它依赖于金标准和生成的摘要之间的词汇重叠。对于具有非常大的词汇表和高类型/标记比率的粘合语言，这种限制变得更加明显。在本文中，我们提出了土耳其语的语义相似度模型，并将其作为抽象摘要任务的评估指标。为了实现这一目标，我们将英文STSb数据集翻译成土耳其语，并提出了土耳其语的第一个语义文本相似度数据集。我们发现，在Pearson和Spearman相关性中，与ROUGE相比，我们的最佳相似性模型与人类平均判断有更好的一致性。

引用次数: 10