PEP: proactive checkpointing for efficient preemption on GPUs

Proceedings. Design Automation Conference Pub Date : 2018-06-01 DOI:10.1109/DAC.2018.8465929

Chen Li, Andrew Zigerelli, Jun Yang, Yang Guo

{"title":"PEP: proactive checkpointing for efficient preemption on GPUs","authors":"Chen Li, Andrew Zigerelli, Jun Yang, Yang Guo","doi":"10.1109/DAC.2018.8465929","DOIUrl":null,"url":null,"abstract":"The demand for multitasking GPUs increases whenever the GPU may be shared by multiple applications, either spatially or temporally. This requires that GPUs can be preempted and switch context to a new application while already executing one. Unlike CPUs, context switching in GPUs is prohibitively expensive due to the large context states to swap out. There have been a number of efforts on reducing the overhead of preemption, through reducing the context sizes or overlapping context switching with execution. All those techniques are reactive approaches, meaning that context switching occurs when the preemption request arrives.In this paper, we propose a proactive mechanism to reduce the latency of preemption. We observe that kernel execution is almost always preceded by known commands in both CUDA and OpenCL implementations. Hence, a preemption can be anticipated before the actual request arrives. We study such lead time and develop a prediction scheme to perform an early state saving. When the actual preemption is invoked, an incremental update relative to the previous saved state is performed, much like the conventional checkpointing mechanism. This design effectively reduces the stall time of the preempting kernel due to context switching by 58.6%. Moreover, through careful handling of the saved state, we can also reduce the overall size of saved state by an average of 23.3%, compared with a full context switching.","PeriodicalId":87346,"journal":{"name":"Proceedings. Design Automation Conference","volume":"21 1","pages":"114:1-114:6"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Design Automation Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAC.2018.8465929","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The demand for multitasking GPUs increases whenever the GPU may be shared by multiple applications, either spatially or temporally. This requires that GPUs can be preempted and switch context to a new application while already executing one. Unlike CPUs, context switching in GPUs is prohibitively expensive due to the large context states to swap out. There have been a number of efforts on reducing the overhead of preemption, through reducing the context sizes or overlapping context switching with execution. All those techniques are reactive approaches, meaning that context switching occurs when the preemption request arrives.In this paper, we propose a proactive mechanism to reduce the latency of preemption. We observe that kernel execution is almost always preceded by known commands in both CUDA and OpenCL implementations. Hence, a preemption can be anticipated before the actual request arrives. We study such lead time and develop a prediction scheme to perform an early state saving. When the actual preemption is invoked, an incremental update relative to the previous saved state is performed, much like the conventional checkpointing mechanism. This design effectively reduces the stall time of the preempting kernel due to context switching by 58.6%. Moreover, through careful handling of the saved state, we can also reduce the overall size of saved state by an average of 23.3%, compared with a full context switching.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PEP:在gpu上有效抢占的主动检查点

当GPU可能被多个应用程序(无论是空间上还是时间上)共享时，对多任务GPU的需求就会增加。这要求gpu可以被抢占，并在已经执行一个应用程序时将上下文切换到一个新的应用程序。与cpu不同，gpu中的上下文切换代价高昂，因为要交换的上下文状态很大。通过减少上下文大小或在执行时重叠上下文切换，已经进行了许多减少抢占开销的工作。所有这些技术都是响应式方法，这意味着在抢占请求到达时发生上下文切换。在本文中，我们提出了一种减少抢占延迟的主动机制。我们观察到，在CUDA和OpenCL实现中，内核执行几乎总是先有已知的命令。因此，在实际请求到达之前可以预见到抢占。我们研究了这种提前期，并开发了一种预测方案来执行早期状态节省。当调用实际的抢占时，将执行相对于前一个保存状态的增量更新，这与传统的检查点机制非常相似。这种设计有效地减少了由于上下文切换导致的抢占内核的停顿时间58.6%。此外，通过仔细处理保存的状态，与完全上下文切换相比，我们还可以将保存状态的总体大小平均减少23.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. Design Automation Conference

自引率

0.00%

发文量

期刊最新文献

Muffin: A Framework Toward Multi-Dimension AI Fairness by Uniting Off-the-Shelf Models. DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10 - 14, 2022 General Chair's Message Exploiting Computation Reuse for Stencil Accelerators. Reconciling remote attestation and safety-critical operation on simple IoT devices