Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Embedded Systems Letters Pub Date : 2024-01-09 DOI:10.1109/LES.2024.3351753
Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh
{"title":"Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs","authors":"Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh","doi":"10.1109/LES.2024.3351753","DOIUrl":null,"url":null,"abstract":"This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve \n<inline-formula> <tex-math>$2.7\\times $ </tex-math></inline-formula>\n better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"421-424"},"PeriodicalIF":1.7000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Embedded Systems Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10384636/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve $2.7\times $ better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
嵌入式 GPU 多租户推理的自适应内核合并与融合
本文提出了一种在基于嵌入式图形处理单元(GPU)的系统中运行多个推理时提高吞吐量和减少排队延迟的新方案。我们观察到,嵌入式系统使用固定数量的深度学习模型运行推理,并且推理请求通常使用相同的模型。与先前提出内核融合或调度技术的工作不同,这封信提出了一种新的软件技术,通过监视队列中的请求来合并和融合内核。建议的技术首先监视固定数量的请求,并将运行相同模型的请求分组。然后,它创建迭代处理分组请求的内核。我们称这种技术为内核合并。然后,利用合并的核进行核融合。最终,我们的想法最小化了并发内核的数量,从而减轻了GPU中频繁上下文切换造成的延迟。在我们的评估中,所提出的内核合并和融合实现了2.7倍的吞吐量,平均内核执行时间缩短了47%,尾部延迟缩短了63%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Embedded Systems Letters
IEEE Embedded Systems Letters Engineering-Control and Systems Engineering
CiteScore
3.30
自引率
0.00%
发文量
65
期刊介绍: The IEEE Embedded Systems Letters (ESL), provides a forum for rapid dissemination of latest technical advances in embedded systems and related areas in embedded software. The emphasis is on models, methods, and tools that ensure secure, correct, efficient and robust design of embedded systems and their applications.
期刊最新文献
Table of Contents Editorial IEEE Embedded Systems Letters Publication Information ViTSen: Bridging Vision Transformers and Edge Computing With Advanced In/Near-Sensor Processing Methodology for Formal Verification of Hardware Safety Strategies Using SMT
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1