Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

ArXiv Pub Date : 2024-02-15 DOI:10.48550/arXiv.2402.10198

Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, I. Redko

引用次数: 0

Abstract

Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses the current state-of-the-art model TSMixer by 14.33% on average, while having ~4 times fewer parameters. The code is available at https://github.com/romilbert/samformer.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用锐度感知最小化和渠道明智关注释放时间序列预测中变压器的潜能

基于变换器的架构在自然语言处理和计算机视觉方面取得了突破性的性能，但在多变量长期预测方面，它们仍然不如简单的线性基线。为了更好地理解这一现象，我们首先研究了一个玩具线性预测问题，结果表明，尽管变换器具有很强的表达能力，但却无法收敛到其真正的解决方案。我们进一步发现，变换器的注意力是造成这种低泛化能力的原因。基于这一见解，我们提出了一种浅层轻量级变换器模型，当使用锐度感知优化法进行优化时，该模型能成功摆脱局部极小值的困境。我们通过实证证明，这一结果适用于现实世界中所有常用的多变量时间序列数据集。特别是，SAMformer 比目前最先进的模型 TSMixer 平均高出 14.33%，而参数却少了约 4 倍。代码见 https://github.com/romilbert/samformer。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ArXiv

自引率

0.00%

发文量