Real block-circulant matrices and DCT-DST algorithm for transformer neural network

IF 1.3 Q3 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Frontiers in Applied Mathematics and Statistics Pub Date : 2023-12-12 DOI:10.3389/fams.2023.1260187

Euis Asriani, I. Muchtadi-Alamsyah, Ayu Purwarianti

{"title":"Real block-circulant matrices and DCT-DST algorithm for transformer neural network","authors":"Euis Asriani, I. Muchtadi-Alamsyah, Ayu Purwarianti","doi":"10.3389/fams.2023.1260187","DOIUrl":null,"url":null,"abstract":"In the encoding and decoding process of transformer neural networks, a weight matrix-vector multiplication occurs in each multihead attention and feed forward sublayer. Assigning the appropriate weight matrix and algorithm can improve transformer performance, especially for machine translation tasks. In this study, we investigate the use of the real block-circulant matrices and an alternative to the commonly used fast Fourier transform (FFT) algorithm, namely, the discrete cosine transform–discrete sine transform (DCT-DST) algorithm, to be implemented in a transformer. We explore three transformer models that combine the use of real block-circulant matrices with different algorithms. We start from generating two orthogonal matrices, U and Q. The matrix U is spanned by the combination of the reals and imaginary parts of eigenvectors of the real block-circulant matrix, whereas Q is defined such that the matrix multiplication QU can be represented in the shape of a DCT-DST matrix. The final step is defining the Schur form of the real block-circulant matrix. We find that the matrix-vector multiplication using the DCT-DST algorithm can be defined by assigning the Kronecker product between the DCT-DST matrix and an orthogonal matrix in the same order as the dimension of the circulant matrix that spanned the real block circulant. According to the experiment's findings, the dense-real block circulant DCT-DST model with largest matrix dimension was able to reduce the number of model parameters up to 41%. The same model of 128 matrix dimension gained 26.47 of BLEU score, higher compared to the other two models on the same matrix dimensions.","PeriodicalId":36662,"journal":{"name":"Frontiers in Applied Mathematics and Statistics","volume":"52 14","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Applied Mathematics and Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fams.2023.1260187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

In the encoding and decoding process of transformer neural networks, a weight matrix-vector multiplication occurs in each multihead attention and feed forward sublayer. Assigning the appropriate weight matrix and algorithm can improve transformer performance, especially for machine translation tasks. In this study, we investigate the use of the real block-circulant matrices and an alternative to the commonly used fast Fourier transform (FFT) algorithm, namely, the discrete cosine transform–discrete sine transform (DCT-DST) algorithm, to be implemented in a transformer. We explore three transformer models that combine the use of real block-circulant matrices with different algorithms. We start from generating two orthogonal matrices, U and Q. The matrix U is spanned by the combination of the reals and imaginary parts of eigenvectors of the real block-circulant matrix, whereas Q is defined such that the matrix multiplication QU can be represented in the shape of a DCT-DST matrix. The final step is defining the Schur form of the real block-circulant matrix. We find that the matrix-vector multiplication using the DCT-DST algorithm can be defined by assigning the Kronecker product between the DCT-DST matrix and an orthogonal matrix in the same order as the dimension of the circulant matrix that spanned the real block circulant. According to the experiment's findings, the dense-real block circulant DCT-DST model with largest matrix dimension was able to reduce the number of model parameters up to 41%. The same model of 128 matrix dimension gained 26.47 of BLEU score, higher compared to the other two models on the same matrix dimensions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于变压器神经网络的实块循环矩阵和 DCT-DST 算法

在变压器神经网络的编码和解码过程中，每个多头注意子层和前馈子层都会发生权重矩阵-向量乘法。分配适当的权重矩阵和算法可以提高变换器的性能，尤其是在机器翻译任务中。在本研究中，我们研究了在转换器中使用实块循环矩阵和常用快速傅立叶变换（FFT）算法的替代算法，即离散余弦变换-离散正弦变换（DCT-DST）算法。我们探索了三种变压器模型，将实块环形矩阵的使用与不同的算法相结合。我们首先生成两个正交矩阵 U 和 Q。矩阵 U 的跨度是实块环矩阵特征向量的实部和虚部的组合，而 Q 的定义是矩阵乘法 QU 可以用 DCT-DST 矩阵的形状表示。最后一步是定义实块环矩阵的舒尔形式。我们发现，使用 DCT-DST 算法的矩阵向量乘法可以通过分配 DCT-DST 矩阵与正交矩阵之间的 Kronecker 乘积来定义，其顺序与跨实块环形矩阵的环形矩阵维数相同。实验结果表明，矩阵维度最大的密集实块环行 DCT-DST 模型能够减少高达 41% 的模型参数数量。矩阵维数为 128 的同一模型获得了 26.47 的 BLEU 分，高于矩阵维数相同的其他两个模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊