Self-Repetition in Abstractive Neural Summarizers

Proceedings of the conference. Association for Computational Linguistics. Meeting Pub Date : 2022-10-14 DOI:10.48550/arXiv.2210.08145

Nikita Salkar, T. Trikalinos, Byron C. Wallace, A. Nenkova

{"title":"Self-Repetition in Abstractive Neural Summarizers","authors":"Nikita Salkar, T. Trikalinos, Byron C. Wallace, A. Nenkova","doi":"10.48550/arXiv.2210.08145","DOIUrl":null,"url":null,"abstract":"We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language is associated with a higher rate of self-repetition. In qualitative analysis, we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"74 5 1","pages":"341-350"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.08145","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language is associated with a higher rate of self-repetition. In qualitative analysis, we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

抽象神经总结器中的自我重复

我们对神经总结器输出中的自我重复进行了定量和定性分析。我们衡量自我重复的方法是在同一系统的多个输出中出现长度为4或更长的n-grams的数量。我们分析了三种流行架构(BART、T5和Pegasus)的行为，并对五个数据集进行了微调。在回归分析中，我们发现这三种架构在输入的输出摘要中重复内容的倾向不同，BART特别倾向于自我重复。对更抽象的数据和以公式化语言为特征的数据进行微调与更高的自我重复率相关。在定性分析中，我们发现系统产生诸如广告和免责声明等与被总结的内容无关的工件，以及微调领域中常见的公式化短语。我们对自我重复的语料库级分析方法可以帮助从业者为总结者清理训练数据，并最终支持最小化自我重复量的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the conference. Association for Computational Linguistics. Meeting

自引率

0.00%

发文量