On Elastic Language Models

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07204

Zhang, Chen, Wang, Benyou, Song, Dawei

{"title":"On Elastic Language Models","authors":"Zhang, Chen, Wang, Benyou, Song, Dawei","doi":"10.48550/arxiv.2311.07204","DOIUrl":null,"url":null,"abstract":"Large-scale pretrained language models have achieved compelling performance in a wide range of language understanding and information retrieval tasks. Knowledge distillation offers an opportunity to compress a large language model to a small one, in order to reach a reasonable latency-performance tradeoff. However, for scenarios where the number of requests (e.g., queries submitted to a search engine) is highly variant, the static tradeoff attained by the compressed language model might not always fit. Once a model is assigned with a static tradeoff, it could be inadequate in that the latency is too high when the number of requests is large or the performance is too low when the number of requests is small. To this end, we propose an elastic language model (ElasticLM) that elastically adjusts the tradeoff according to the request stream. The basic idea is to introduce a compute elasticity to the compressed language model, so that the tradeoff could vary on-the-fly along scalable and controllable compute. Specifically, we impose an elastic structure to enable ElasticLM with compute elasticity and design an elastic optimization to learn ElasticLM under compute elasticity. To serve ElasticLM, we apply an elastic schedule. Considering the specificity of information retrieval, we adapt ElasticLM to dense retrieval and reranking and present ElasticDenser and ElasticRanker respectively. Offline evaluation is conducted on a language understanding benchmark GLUE; and several information retrieval tasks including Natural Question, Trivia QA, and MS MARCO. The results show that ElasticLM along with ElasticDenser and ElasticRanker can perform correctly and competitively compared with an array of static baselines. Furthermore, online simulation with concurrency is also carried out. The results demonstrate that ElasticLM can provide elastic tradeoffs with respect to varying request stream.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"117 50","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large-scale pretrained language models have achieved compelling performance in a wide range of language understanding and information retrieval tasks. Knowledge distillation offers an opportunity to compress a large language model to a small one, in order to reach a reasonable latency-performance tradeoff. However, for scenarios where the number of requests (e.g., queries submitted to a search engine) is highly variant, the static tradeoff attained by the compressed language model might not always fit. Once a model is assigned with a static tradeoff, it could be inadequate in that the latency is too high when the number of requests is large or the performance is too low when the number of requests is small. To this end, we propose an elastic language model (ElasticLM) that elastically adjusts the tradeoff according to the request stream. The basic idea is to introduce a compute elasticity to the compressed language model, so that the tradeoff could vary on-the-fly along scalable and controllable compute. Specifically, we impose an elastic structure to enable ElasticLM with compute elasticity and design an elastic optimization to learn ElasticLM under compute elasticity. To serve ElasticLM, we apply an elastic schedule. Considering the specificity of information retrieval, we adapt ElasticLM to dense retrieval and reranking and present ElasticDenser and ElasticRanker respectively. Offline evaluation is conducted on a language understanding benchmark GLUE; and several information retrieval tasks including Natural Question, Trivia QA, and MS MARCO. The results show that ElasticLM along with ElasticDenser and ElasticRanker can perform correctly and competitively compared with an array of static baselines. Furthermore, online simulation with concurrency is also carried out. The results demonstrate that ElasticLM can provide elastic tradeoffs with respect to varying request stream.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

弹性语言模型

大规模的预训练语言模型在广泛的语言理解和信息检索任务中取得了令人瞩目的成绩。知识蒸馏提供了一个将大型语言模型压缩为小型语言模型的机会，以达到合理的延迟-性能折衷。然而，对于请求数量(例如，提交给搜索引擎的查询)变化很大的场景，压缩语言模型获得的静态权衡可能并不总是合适的。一旦为模型分配了静态权衡，它可能是不够的，因为当请求数量大时延迟太高，或者当请求数量小时性能太低。为此，我们提出了一种弹性语言模型(elasticm)，它可以根据请求流弹性地调整权衡。基本思想是在压缩语言模型中引入计算弹性，这样权衡就可以根据可伸缩和可控的计算动态变化。具体来说，我们通过施加弹性结构来实现具有计算弹性的elasticm，并设计一个弹性优化来学习具有计算弹性的elasticm。为了服务elasticm，我们应用了一个弹性调度。考虑到信息检索的特殊性，我们将elasticclm应用于密集检索和重排序，并分别提出了ElasticDenser和ElasticRanker。对语言理解基准GLUE进行离线评估;和一些信息检索任务，包括自然问题，问答问答，和MS MARCO。结果表明，与静态基线阵列相比，elasticclm与ElasticDenser和ElasticRanker可以正确执行并具有竞争力。此外，还进行了并行在线仿真。结果表明，elasticm可以针对不同的请求流提供弹性折衷。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量

期刊最新文献

CCD Photometry of the Globular Cluster NGC 5897 The Distribution of Sandpile Groups of Random Graphs with their Pairings CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings Full-dry Flipping Transfer Method for van der Waals Heterostructure Code-Aided Channel Estimation in LDPC-Coded MIMO Systems