{"title":"用于高效线性时序建模的门控插槽注意力","authors":"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu","doi":"arxiv-2409.07146","DOIUrl":null,"url":null,"abstract":"Linear attention Transformers and their gated variants, celebrated for\nenabling parallel training and efficient recurrent inference, still fall short\nin recall-intensive tasks compared to traditional Transformers and demand\nsignificant resources for training from scratch. This paper introduces Gated\nSlot Attention (GSA), which enhances Attention with Bounded-memory-Control\n(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\n(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\ncontext-aware memory reading and adaptive forgetting to improve memory capacity\nwhile maintaining compact recurrent state size. This design greatly enhances\nboth training and inference efficiency through GLA's hardware-efficient\ntraining algorithm and reduced state size. Additionally, retaining the softmax\noperation is particularly beneficial in \"finetuning pretrained Transformers to\nRNNs\" (T2R) settings, reducing the need for extensive training from scratch.\nExtensive experiments confirm GSA's superior performance in scenarios requiring\nin-context recall and in T2R settings.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gated Slot Attention for Efficient Linear-Time Sequence Modeling\",\"authors\":\"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu\",\"doi\":\"arxiv-2409.07146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Linear attention Transformers and their gated variants, celebrated for\\nenabling parallel training and efficient recurrent inference, still fall short\\nin recall-intensive tasks compared to traditional Transformers and demand\\nsignificant resources for training from scratch. This paper introduces Gated\\nSlot Attention (GSA), which enhances Attention with Bounded-memory-Control\\n(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\\n(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\\ncontext-aware memory reading and adaptive forgetting to improve memory capacity\\nwhile maintaining compact recurrent state size. This design greatly enhances\\nboth training and inference efficiency through GLA's hardware-efficient\\ntraining algorithm and reduced state size. Additionally, retaining the softmax\\noperation is particularly beneficial in \\\"finetuning pretrained Transformers to\\nRNNs\\\" (T2R) settings, reducing the need for extensive training from scratch.\\nExtensive experiments confirm GSA's superior performance in scenarios requiring\\nin-context recall and in T2R settings.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
线性注意变换器及其门控变体虽然被认为可以实现并行训练和高效的循环推理,但与传统变换器相比,它们在回忆密集型任务中仍有不足,而且需要大量资源从头开始训练。本文介绍了门控插槽注意力(GatedSlot Attention,GSA),它通过结合受门控线性注意力(Gated Linear Attention,GLA)启发的门控机制,增强了有界内存控制注意力(Attention with Bounded-memory-Control,ABC)。从本质上讲,GSA 包括一个通过软最大值(softmax)连接的双层 GLA,利用上下文感知记忆读取和自适应遗忘来提高记忆容量,同时保持紧凑的递归状态大小。这种设计通过 GLA 的硬件系数训练算法和更小的状态大小,大大提高了训练和推理效率。此外,保留软最大操作在 "微调预训练变换器到 RNN"(T2R)设置中尤为有利,减少了从头开始进行大量训练的需要。
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Linear attention Transformers and their gated variants, celebrated for
enabling parallel training and efficient recurrent inference, still fall short
in recall-intensive tasks compared to traditional Transformers and demand
significant resources for training from scratch. This paper introduces Gated
Slot Attention (GSA), which enhances Attention with Bounded-memory-Control
(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention
(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing
context-aware memory reading and adaptive forgetting to improve memory capacity
while maintaining compact recurrent state size. This design greatly enhances
both training and inference efficiency through GLA's hardware-efficient
training algorithm and reduced state size. Additionally, retaining the softmax
operation is particularly beneficial in "finetuning pretrained Transformers to
RNNs" (T2R) settings, reducing the need for extensive training from scratch.
Extensive experiments confirm GSA's superior performance in scenarios requiring
in-context recall and in T2R settings.