Xinrong Zhang , Shengding Hu , Weilin Zhao , Huadong Wang , Xu Han , Chaoqun He , Guoyang Zeng , Zhiyuan Liu , Maosong Sun
{"title":"Optimal RoPE extension via Bayesian Optimization for training-free length generalization","authors":"Xinrong Zhang , Shengding Hu , Weilin Zhao , Huadong Wang , Xu Han , Chaoqun He , Guoyang Zeng , Zhiyuan Liu , Maosong Sun","doi":"10.1016/j.aiopen.2025.01.002","DOIUrl":null,"url":null,"abstract":"<div><div>Transformers are designed to process input of variable length without resource constraints. However, their performance significantly deteriorates when the input surpasses a threshold slightly larger than the pre-training context window. This limitation on the effective context window confines the application of Transformer-based large language models (LLMs) that have been the subject of great anticipation. Consequently, the generalization of pre-trained LLMs to handle varying input lengths becomes a pivotal and formidable challenge. Previous research has endeavored to address this challenge by modifying the Rotary Position Embedding (RoPE), the primary factor responsible for disparities in handling different input lengths. These efforts have provided valuable insights, while they often lack a deep understanding of the root causes of performance degradation and rely heavily on manual parameter tuning. In response to these issues, we conduct a comprehensive analysis and identify two primary causes behind the performance drop: global distribution mismatch and local resolution degradation. In light of these challenges, we introduce an Optimal RoPE (ORoPE) extension using Bayesian Optimization (BO), which alleviates the need for additional model training. Our experiments demonstrate the efficacy of our approach, outperforming baselines by up to 21.9%, 32.1%, and 41.2% at evaluation lengths of 8K, 16K, and 32K, respectively. We will release all code and data when this paper is published.</div></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"6 ","pages":"Pages 1-11"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651025000026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Transformers are designed to process input of variable length without resource constraints. However, their performance significantly deteriorates when the input surpasses a threshold slightly larger than the pre-training context window. This limitation on the effective context window confines the application of Transformer-based large language models (LLMs) that have been the subject of great anticipation. Consequently, the generalization of pre-trained LLMs to handle varying input lengths becomes a pivotal and formidable challenge. Previous research has endeavored to address this challenge by modifying the Rotary Position Embedding (RoPE), the primary factor responsible for disparities in handling different input lengths. These efforts have provided valuable insights, while they often lack a deep understanding of the root causes of performance degradation and rely heavily on manual parameter tuning. In response to these issues, we conduct a comprehensive analysis and identify two primary causes behind the performance drop: global distribution mismatch and local resolution degradation. In light of these challenges, we introduce an Optimal RoPE (ORoPE) extension using Bayesian Optimization (BO), which alleviates the need for additional model training. Our experiments demonstrate the efficacy of our approach, outperforming baselines by up to 21.9%, 32.1%, and 41.2% at evaluation lengths of 8K, 16K, and 32K, respectively. We will release all code and data when this paper is published.