超越云端:无线网络中生成式大型语言模型的边缘推理

IF 10.7 1区 计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Wireless Communications Pub Date : 2024-11-20 DOI:10.1109/TWC.2024.3497923
Xinyuan Zhang;Jiangtian Nie;Yudong Huang;Gaochang Xie;Zehui Xiong;Jiang Liu;Dusit Niyato;Xuemin Shen
{"title":"超越云端:无线网络中生成式大型语言模型的边缘推理","authors":"Xinyuan Zhang;Jiangtian Nie;Yudong Huang;Gaochang Xie;Zehui Xiong;Jiang Liu;Dusit Niyato;Xuemin Shen","doi":"10.1109/TWC.2024.3497923","DOIUrl":null,"url":null,"abstract":"Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and \n<inline-formula> <tex-math>$\\frac {1}{2}$ </tex-math></inline-formula>\n-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.","PeriodicalId":13431,"journal":{"name":"IEEE Transactions on Wireless Communications","volume":"24 1","pages":"643-658"},"PeriodicalIF":10.7000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks\",\"authors\":\"Xinyuan Zhang;Jiangtian Nie;Yudong Huang;Gaochang Xie;Zehui Xiong;Jiang Liu;Dusit Niyato;Xuemin Shen\",\"doi\":\"10.1109/TWC.2024.3497923\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and \\n<inline-formula> <tex-math>$\\\\frac {1}{2}$ </tex-math></inline-formula>\\n-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.\",\"PeriodicalId\":13431,\"journal\":{\"name\":\"IEEE Transactions on Wireless Communications\",\"volume\":\"24 1\",\"pages\":\"643-658\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2024-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Wireless Communications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10759588/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Wireless Communications","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10759588/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

生成式人工智能(GAI)以其前所未有的内容创造能力正在改变世界。大型语言模型(LLM)是其最受欢迎的分支之一。然而,由于LLM的庞大规模和资源密集型性质,它是云托管的,这引起了对隐私、使用限制和延迟的担忧。在本文中,我们提出利用无处不在的分布式无线边缘计算资源进行实时LLM推理。具体来说,我们引入了一种新的LLM边缘推理框架,结合批处理和模型量化,以确保在资源有限的边缘设备上进行高吞吐量推理。然后,在基于变压器解码器的llm架构的基础上,考虑批量调度和通信资源和计算资源的联合分配,提出了一个np困难的边缘推理优化问题。该解决方案是在边缘资源约束和用户对延迟和准确性的异构需求下的最佳吞吐量。为了解决这个np困难问题,我们开发了一种具有合理复杂度和$\frac{1}{2}$ -近似比的OT-GAH(最优树搜索与广义分配启发式)算法。我们首先针对单边节点多用户情况设计了在线树修剪OT算法,该算法在树结构中导航推理请求选择,以最大限度地提高吞吐量。然后,我们考虑了多边节点的情况,提出了GAH算法,该算法在每个节点的推理调度迭代中递归调用OT。仿真结果表明,OT-GAH批处理优于其他基准测试,与暴力搜索相比,时间复杂度降低了45%以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks
Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and $\frac {1}{2}$ -approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
18.60
自引率
10.60%
发文量
708
审稿时长
5.6 months
期刊介绍: The IEEE Transactions on Wireless Communications is a prestigious publication that showcases cutting-edge advancements in wireless communications. It welcomes both theoretical and practical contributions in various areas. The scope of the Transactions encompasses a wide range of topics, including modulation and coding, detection and estimation, propagation and channel characterization, and diversity techniques. The journal also emphasizes the physical and link layer communication aspects of network architectures and protocols. The journal is open to papers on specific topics or non-traditional topics related to specific application areas. This includes simulation tools and methodologies, orthogonal frequency division multiplexing, MIMO systems, and wireless over optical technologies. Overall, the IEEE Transactions on Wireless Communications serves as a platform for high-quality manuscripts that push the boundaries of wireless communications and contribute to advancements in the field.
期刊最新文献
Performance Analysis and Optimization Design of Uplink RSMA-Enabled Cell-Free Massive MIMO Systems with Hardware Impairments Energy-Efficient Federated Edge Learning For Small-Scale Datasets in Large IoT Networks Rotatable Antenna Enabled Spectrum Sharing: Joint Antenna Orientation and Beamforming Design Matched Filtering-Based Channel Estimation for AFDM Systems in Doubly Selective Channels Rotatable Antenna Enabled Multi-Cell Mixed Near-Field and Far-Field Communications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1