How do control tokens affect natural language generation tasks like text simplification

IF 2.3 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2024-01-23 DOI:10.1017/s1351324923000566

Zihao Li, Matthew Shardlow

引用次数: 0

Abstract

Recent work on text simplification has focused on the use of control tokens to further the state-of-the-art. However, it is not easy to further improve without an in-depth comprehension of the mechanisms underlying control tokens. One unexplored factor is the tokenization strategy, which we also explore. In this paper, we (1) reimplemented AudienCe-CEntric Sentence Simplification, (2) explored the effects and interactions of varying control tokens, (3) tested the influences of different tokenization strategies, (4) demonstrated how separate control tokens affect performance and (5) proposed new methods to predict the value of control tokens. We show variations of performance in the four control tokens separately. We also uncover how the design of control tokens could influence performance and give some suggestions for designing control tokens. We show the newly proposed method with higher performance in both SARI (a common scoring metric in text simplificaiton) and BERTScore (a score derived from the BERT language model) and potential in real applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

控制标记如何影响文本简化等自然语言生成任务

最近的文本简化工作主要集中在控制标记的使用上，以进一步提高技术水平。但是，如果不深入了解控制标记的基本机制，要想进一步提高水平并非易事。其中一个尚未探索的因素就是标记化策略，我们也对此进行了探讨。在本文中，我们（1）重新实现了 AudienCe-CEntric Sentence Simplification，（2）探索了不同控制标记的效果和相互作用，（3）测试了不同标记化策略的影响，（4）展示了单独控制标记对性能的影响，（5）提出了预测控制标记价值的新方法。我们分别展示了四个控制标记的性能变化。我们还揭示了控制令牌的设计如何影响性能，并给出了一些设计控制令牌的建议。我们展示了新提出的方法在 SARI（文本简化中常用的评分标准）和 BERTScore（由 BERT 语言模型得出的分数）方面都有较高的性能，并在实际应用中具有潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.