用于高效线性时序建模的门控插槽注意力

arXiv - CS - Computation and Language Pub Date : 2024-09-11 DOI:arxiv-2409.07146

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

{"title":"用于高效线性时序建模的门控插槽注意力","authors":"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu","doi":"arxiv-2409.07146","DOIUrl":null,"url":null,"abstract":"Linear attention Transformers and their gated variants, celebrated for\nenabling parallel training and efficient recurrent inference, still fall short\nin recall-intensive tasks compared to traditional Transformers and demand\nsignificant resources for training from scratch. This paper introduces Gated\nSlot Attention (GSA), which enhances Attention with Bounded-memory-Control\n(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\n(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\ncontext-aware memory reading and adaptive forgetting to improve memory capacity\nwhile maintaining compact recurrent state size. This design greatly enhances\nboth training and inference efficiency through GLA's hardware-efficient\ntraining algorithm and reduced state size. Additionally, retaining the softmax\noperation is particularly beneficial in \"finetuning pretrained Transformers to\nRNNs\" (T2R) settings, reducing the need for extensive training from scratch.\nExtensive experiments confirm GSA's superior performance in scenarios requiring\nin-context recall and in T2R settings.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gated Slot Attention for Efficient Linear-Time Sequence Modeling\",\"authors\":\"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu\",\"doi\":\"arxiv-2409.07146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Linear attention Transformers and their gated variants, celebrated for\\nenabling parallel training and efficient recurrent inference, still fall short\\nin recall-intensive tasks compared to traditional Transformers and demand\\nsignificant resources for training from scratch. This paper introduces Gated\\nSlot Attention (GSA), which enhances Attention with Bounded-memory-Control\\n(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\\n(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\\ncontext-aware memory reading and adaptive forgetting to improve memory capacity\\nwhile maintaining compact recurrent state size. This design greatly enhances\\nboth training and inference efficiency through GLA's hardware-efficient\\ntraining algorithm and reduced state size. Additionally, retaining the softmax\\noperation is particularly beneficial in \\\"finetuning pretrained Transformers to\\nRNNs\\\" (T2R) settings, reducing the need for extensive training from scratch.\\nExtensive experiments confirm GSA's superior performance in scenarios requiring\\nin-context recall and in T2R settings.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

线性注意变换器及其门控变体虽然被认为可以实现并行训练和高效的循环推理，但与传统变换器相比，它们在回忆密集型任务中仍有不足，而且需要大量资源从头开始训练。本文介绍了门控插槽注意力（GatedSlot Attention，GSA），它通过结合受门控线性注意力（Gated Linear Attention，GLA）启发的门控机制，增强了有界内存控制注意力（Attention with Bounded-memory-Control，ABC）。从本质上讲，GSA 包括一个通过软最大值（softmax）连接的双层 GLA，利用上下文感知记忆读取和自适应遗忘来提高记忆容量，同时保持紧凑的递归状态大小。这种设计通过 GLA 的硬件系数训练算法和更小的状态大小，大大提高了训练和推理效率。此外，保留软最大操作在 "微调预训练变换器到 RNN"（T2R）设置中尤为有利，减少了从头开始进行大量训练的需要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量