Kolmogorov-Arnold Transformer

arXiv - CS - Neural and Evolutionary Computing Pub Date : 2024-09-16 DOI:arxiv-2409.10594

Xingyi Yang, Xinchao Wang

{"title":"Kolmogorov-Arnold Transformer","authors":"Xingyi Yang, Xinchao Wang","doi":"arxiv-2409.10594","DOIUrl":null,"url":null,"abstract":"Transformers stand as the cornerstone of mordern deep learning.\nTraditionally, these models rely on multi-layer perceptron (MLP) layers to mix\nthe information between channels. In this paper, we introduce the\nKolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP\nlayers with Kolmogorov-Arnold Network (KAN) layers to enhance the\nexpressiveness and performance of the model. Integrating KANs into\ntransformers, however, is no easy feat, especially when scaled up.\nSpecifically, we identify three key challenges: (C1) Base function. The\nstandard B-spline function used in KANs is not optimized for parallel computing\non modern hardware, resulting in slower inference speeds. (C2) Parameter and\nComputation Inefficiency. KAN requires a unique function for each input-output\npair, making the computation extremely large. (C3) Weight initialization. The\ninitialization of weights in KANs is particularly challenging due to their\nlearnable activation functions, which are critical for achieving convergence in\ndeep neural networks. To overcome the aforementioned challenges, we propose\nthree key solutions: (S1) Rational basis. We replace B-spline functions with\nrational functions to improve compatibility with modern GPUs. By implementing\nthis in CUDA, we achieve faster computations. (S2) Group KAN. We share the\nactivation weights through a group of neurons, to reduce the computational load\nwithout sacrificing performance. (S3) Variance-preserving initialization. We\ncarefully initialize the activation weights to make sure that the activation\nvariance is maintained across layers. With these designs, KAT scales\neffectively and readily outperforms traditional MLP-based transformers.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"105 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

柯尔莫哥洛夫-阿诺德变换器

变压器是现代深度学习的基石。传统上，这些模型依靠多层感知器（MLP）层来混合通道之间的信息。在本文中，我们介绍了柯尔莫哥洛夫-阿诺德变换器（KAT），这是一种新颖的架构，用柯尔莫哥洛夫-阿诺德网络（KAN）层取代了 MLP 层，从而提高了模型的可执行性和性能。然而，将 KAN 集成到转换器中并非易事，尤其是在扩大规模时。具体而言，我们发现了三个关键挑战：（C1）基础函数。KANs 中使用的标准 B-样条函数没有针对现代硬件的并行计算进行优化，导致推断速度较慢。(C2) 参数和计算效率低下。KAN 要求每个输入输出对都使用唯一的函数，这使得计算量极大。(C3) 权重初始化。KAN 中权重的初始化尤其具有挑战性，因为其激活函数是可学习的，而激活函数是实现深度神经网络收敛的关键。为了克服上述挑战，我们提出了三个主要解决方案：（S1）有理基础。我们用有理函数取代 B-样条函数，以提高与现代 GPU 的兼容性。通过在 CUDA 中实施，我们实现了更快的计算速度。(S2) 组 KAN。我们通过一组神经元共享激活权重，在不牺牲性能的情况下减少计算负荷。(S3) 保留方差的初始化。我们精心初始化激活权重，以确保各层之间保持激活方差。有了这些设计，KAT 可以有效地扩展，并轻松超越基于 MLP 的传统转换器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Neural and Evolutionary Computing

自引率

0.00%

发文量