超越云端：无线网络中生成式大型语言模型的边缘推理

IF 10.7 1区计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Wireless Communications Pub Date : 2024-11-20 DOI:10.1109/TWC.2024.3497923

Xinyuan Zhang;Jiangtian Nie;Yudong Huang;Gaochang Xie;Zehui Xiong;Jiang Liu;Dusit Niyato;Xuemin Shen

{"title":"超越云端：无线网络中生成式大型语言模型的边缘推理","authors":"Xinyuan Zhang;Jiangtian Nie;Yudong Huang;Gaochang Xie;Zehui Xiong;Jiang Liu;Dusit Niyato;Xuemin Shen","doi":"10.1109/TWC.2024.3497923","DOIUrl":null,"url":null,"abstract":"Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and \n<inline-formula> <tex-math>$\\frac {1}{2}$ </tex-math></inline-formula>\n-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.","PeriodicalId":13431,"journal":{"name":"IEEE Transactions on Wireless Communications","volume":"24 1","pages":"643-658"},"PeriodicalIF":10.7000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks\",\"authors\":\"Xinyuan Zhang;Jiangtian Nie;Yudong Huang;Gaochang Xie;Zehui Xiong;Jiang Liu;Dusit Niyato;Xuemin Shen\",\"doi\":\"10.1109/TWC.2024.3497923\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and \\n<inline-formula> <tex-math>$\\\\frac {1}{2}$ </tex-math></inline-formula>\\n-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.\",\"PeriodicalId\":13431,\"journal\":{\"name\":\"IEEE Transactions on Wireless Communications\",\"volume\":\"24 1\",\"pages\":\"643-658\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2024-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Wireless Communications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10759588/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Wireless Communications","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10759588/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

生成式人工智能（GAI）以其前所未有的内容创造能力正在改变世界。大型语言模型（LLM）是其最受欢迎的分支之一。然而，由于LLM的庞大规模和资源密集型性质，它是云托管的，这引起了对隐私、使用限制和延迟的担忧。在本文中，我们提出利用无处不在的分布式无线边缘计算资源进行实时LLM推理。具体来说，我们引入了一种新的LLM边缘推理框架，结合批处理和模型量化，以确保在资源有限的边缘设备上进行高吞吐量推理。然后，在基于变压器解码器的llm架构的基础上，考虑批量调度和通信资源和计算资源的联合分配，提出了一个np困难的边缘推理优化问题。该解决方案是在边缘资源约束和用户对延迟和准确性的异构需求下的最佳吞吐量。为了解决这个np困难问题，我们开发了一种具有合理复杂度和$\frac{1}{2}$ -近似比的OT-GAH（最优树搜索与广义分配启发式）算法。我们首先针对单边节点多用户情况设计了在线树修剪OT算法，该算法在树结构中导航推理请求选择，以最大限度地提高吞吐量。然后，我们考虑了多边节点的情况，提出了GAH算法，该算法在每个节点的推理调度迭代中递归调用OT。仿真结果表明，OT-GAH批处理优于其他基准测试，与暴力搜索相比，时间复杂度降低了45%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and

$\frac {1}{2}$

-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Wireless Communications 工程技术-电信学

CiteScore

18.60

自引率

10.60%

发文量

708

审稿时长

5.6 months

期刊介绍： The IEEE Transactions on Wireless Communications is a prestigious publication that showcases cutting-edge advancements in wireless communications. It welcomes both theoretical and practical contributions in various areas. The scope of the Transactions encompasses a wide range of topics, including modulation and coding, detection and estimation, propagation and channel characterization, and diversity techniques. The journal also emphasizes the physical and link layer communication aspects of network architectures and protocols. The journal is open to papers on specific topics or non-traditional topics related to specific application areas. This includes simulation tools and methodologies, orthogonal frequency division multiplexing, MIMO systems, and wireless over optical technologies. Overall, the IEEE Transactions on Wireless Communications serves as a platform for high-quality manuscripts that push the boundaries of wireless communications and contribute to advancements in the field.