How to optimize Compute Drivers? Let’s start with writing good benchmarks!

International Workshop on OpenCL Pub Date : 2022-05-10 DOI:10.1145/3529538.3529569

Michał Mrozek

{"title":"How to optimize Compute Drivers? Let’s start with writing good benchmarks!","authors":"Michał Mrozek","doi":"10.1145/3529538.3529569","DOIUrl":null,"url":null,"abstract":"Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort. This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup. Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others. Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations. We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard. We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3529569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort. This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup. Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others. Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations. We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard. We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

如何优化计算驱动程序?让我们从编写好的基准开始吧!

编写高效的驱动程序堆栈是每个驱动程序开发人员的目标，但是要查看堆栈是否具有性能，您需要能够确认这一点的工具。您可以尝试运行工作负载和基准测试，看看驱动程序的性能如何，但这只能给您一个汇总分数，由许多部分组成。为了进一步优化这一点，您需要采取广泛的步骤来理解应用程序，找出瓶颈并对其进行优化，这是一个非常耗时的过程，涉及大量的工作。这就需要驱动程序团队编写一个工具，这将使驱动程序的性能工作更容易，所以我们创建了计算基准。在这个套件中，我们测试了驱动程序堆栈的各个方面，看看它们是否没有任何瓶颈。每个测试只检查一件事，并且是独立完成的，因此很容易对其进行优化，不需要任何广泛的设置。基准测试关注每个驱动程序的微妙方面，如:每个调用的API开销、提交延迟、资源创建成本、传输带宽、多线程争用、多进程执行等等。框架为多个后端提供了功能，目前我们有OpenCL和Level Zero实现，所以很容易比较不同驱动程序下相同场景的服务。比较不同厂商的驱动实现也很容易，因为用OpenCL编写的测试可以跨不同的GPU实现工作。我们还使用这些代码来展示好的和坏的编码实践，这对展示简单的事情如何极大地提高性能非常有用，用户可以简单地运行这些场景，看看性能在他们自己的设置上是如何变化的。它也是一个很好的工具，可以为新的扩展创建原型，并进一步将它们作为OpenCL标准的一部分提出。我们计划在2022年第二季度开源这个项目，预计在IWOCL期间已经可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL