OpenCL pipes offer a powerful construct for synthesizing multi-kernel FPGA applications with inter-kernel communication dependencies. The communication discipline between the FPGA kernels is restricted to producer-consumer style patterns supported with on-chip FPGA FIFOs. While this provides few restrictions on the usage, the OpenCL compiler is unable to provide guarantees on buffering capacity or schedulability of the connected kernels. Without these guarantees, an OpenCL developer may over-provision hardware resources or assume pessimistic timing during scheduling. We propose imposing a communication discipline inspired from models of computation (e.g.Ptolemy) such as synchronous dataflow (SDF), and bulk synchronous (BSP). These models offer a restricted subset of communication patterns that enable implementation tradeoffs and deliver performance and resource guarantees. This is useful for OpenCL developers operating within the constraints of the FPGA device. We provide a preliminary analysis of our proposal and sketch programmer and compiler responsibilities that would be needed for integrating these features into the FPGA OpenCL environment.
{"title":"Applying Models of Computation to OpenCL Pipes for FPGA Computing","authors":"Nachiket Kapre, Hiren D. Patel","doi":"10.1145/3078155.3078163","DOIUrl":"https://doi.org/10.1145/3078155.3078163","url":null,"abstract":"OpenCL pipes offer a powerful construct for synthesizing multi-kernel FPGA applications with inter-kernel communication dependencies. The communication discipline between the FPGA kernels is restricted to producer-consumer style patterns supported with on-chip FPGA FIFOs. While this provides few restrictions on the usage, the OpenCL compiler is unable to provide guarantees on buffering capacity or schedulability of the connected kernels. Without these guarantees, an OpenCL developer may over-provision hardware resources or assume pessimistic timing during scheduling. We propose imposing a communication discipline inspired from models of computation (e.g.Ptolemy) such as synchronous dataflow (SDF), and bulk synchronous (BSP). These models offer a restricted subset of communication patterns that enable implementation tradeoffs and deliver performance and resource guarantees. This is useful for OpenCL developers operating within the constraints of the FPGA device. We provide a preliminary analysis of our proposal and sketch programmer and compiler responsibilities that would be needed for integrating these features into the FPGA OpenCL environment.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126824407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Task scheduling and memory management are challenges that make Heterogeneous Computing difficult for the masses. There are several programming models and tools that exist targeting partitioning of workload and accessibility of data between CPU and GPU. We have developed and deployed Symphony SDK - a framework that makes workload partitioning, scheduling and memory management 'simple' for developers. In this talk, we will introduce Symphony architecture, elaborate how existing OpenCL kernels can be reused with heterogeneous task synchronization, task scheduling, and memory management capabilities of Symphony. We will also share real-world cases where Symphony has provided 2x-6x performance speed-ups.
{"title":"Symphony: Task Scheduling and Memory Management in Heterogeneous Computing","authors":"Amit Jindal, Wenjia Ruan","doi":"10.1145/3078155.3078171","DOIUrl":"https://doi.org/10.1145/3078155.3078171","url":null,"abstract":"Task scheduling and memory management are challenges that make Heterogeneous Computing difficult for the masses. There are several programming models and tools that exist targeting partitioning of workload and accessibility of data between CPU and GPU. We have developed and deployed Symphony SDK - a framework that makes workload partitioning, scheduling and memory management 'simple' for developers. In this talk, we will introduce Symphony architecture, elaborate how existing OpenCL kernels can be reused with heterogeneous task synchronization, task scheduling, and memory management capabilities of Symphony. We will also share real-world cases where Symphony has provided 2x-6x performance speed-ups.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"461 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125810096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing uptake of portable, parallel programming models such as OpenCL has fueled extensive research into performance portability. Automatic performance tuning techniques have shown promise for generating kernels which are highly optimized for specific architectures, but do not address the issue of performance portability directly. With the range of architectures and possible optimizations continuously growing, the concept of achieving performance portability from a single code base becomes ever more attractive. In this talk, we present an approach for analyzing performance portability that exploits that black-box nature of automatic performance tuning techniques. We demonstrate this approach across a diverse range of GPU and CPU architectures for two simple OpenCL applications. We then discuss the potential for auto-tuning to aid the generation of performance portable OpenCL kernels by incorporating multi-objective optimization techniques into the tuning process.
{"title":"Analyzing and improving performance portability of OpenCL applications via auto-tuning","authors":"J. Price, Simon McIntosh-Smith","doi":"10.1145/3078155.3078173","DOIUrl":"https://doi.org/10.1145/3078155.3078173","url":null,"abstract":"The increasing uptake of portable, parallel programming models such as OpenCL has fueled extensive research into performance portability. Automatic performance tuning techniques have shown promise for generating kernels which are highly optimized for specific architectures, but do not address the issue of performance portability directly. With the range of architectures and possible optimizations continuously growing, the concept of achieving performance portability from a single code base becomes ever more attractive. In this talk, we present an approach for analyzing performance portability that exploits that black-box nature of automatic performance tuning techniques. We demonstrate this approach across a diverse range of GPU and CPU architectures for two simple OpenCL applications. We then discuss the potential for auto-tuning to aid the generation of performance portable OpenCL kernels by incorporating multi-objective optimization techniques into the tuning process.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125914429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present our experiences in designing, implementing and evaluating efficient applications of the wavefront pattern for block-level motion estimation in video encoding algorithms using OpenCL™ kernels on Intel® Processor Graphics™. We implement multiple solutions exploring different performance considerations, evaluate their pros and cons, present performance data, and provide our recommendations.
{"title":"Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms","authors":"Biju George, Ben Ashbaugh","doi":"10.1145/3078155.3078177","DOIUrl":"https://doi.org/10.1145/3078155.3078177","url":null,"abstract":"In this paper, we present our experiences in designing, implementing and evaluating efficient applications of the wavefront pattern for block-level motion estimation in video encoding algorithms using OpenCL™ kernels on Intel® Processor Graphics™. We implement multiple solutions exploring different performance considerations, evaluate their pros and cons, present performance data, and provide our recommendations.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The acceptance and success of cloud computing has given application developers access to computing and new customers at a scale never seen below. The inherent ability of an FPGA to reconfigure and be workload optimized is a great advantage given the fast-moving needs of cloud computing applications. In this talk we will discuss how users can develop, accelerate and deploy accelerated applications in the cloud at scale. You will learn how to get started on a turn-key OpenCL development environment in the cloud using Xilinx FPGAs.
{"title":"Accelerating Applications at Cloud Scale using FPGAs","authors":"Spenser Gilliland","doi":"10.1145/3078155.3078179","DOIUrl":"https://doi.org/10.1145/3078155.3078179","url":null,"abstract":"The acceptance and success of cloud computing has given application developers access to computing and new customers at a scale never seen below. The inherent ability of an FPGA to reconfigure and be workload optimized is a great advantage given the fast-moving needs of cloud computing applications. In this talk we will discuss how users can develop, accelerate and deploy accelerated applications in the cloud at scale. You will learn how to get started on a turn-key OpenCL development environment in the cloud using Xilinx FPGAs.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127341368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SYCL™ is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of modern C++11/14. For example, SYCL enables single source development where C++ template functions are compiled for both host and device to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data. Using SYCL can simplify development and reduce the amount of code required for applications using OpenCL devices by over 50% compared to standard OpenCL code. This is because of the use of template functions and a simplified, streamlined host API. This hands-on session will provide an opportunity to get experience with SYCL using ComputeCpp™ Community Edition, a free to use implementation of the SYCL 1.2 standard. Attendees will be shown how to set up ComputeCpp and use it to write their own SYCL code to run on supported GPUs and CPUs.
{"title":"Heterogeneous Computing Using Modern C++ with OpenCL Devices: Tutorial at IWOCL 2017","authors":"Rod Burns, Ruymán Reyes","doi":"10.1145/3078155.3078159","DOIUrl":"https://doi.org/10.1145/3078155.3078159","url":null,"abstract":"SYCL™ is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of modern C++11/14. For example, SYCL enables single source development where C++ template functions are compiled for both host and device to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data. Using SYCL can simplify development and reduce the amount of code required for applications using OpenCL devices by over 50% compared to standard OpenCL code. This is because of the use of template functions and a simplified, streamlined host API. This hands-on session will provide an opportunity to get experience with SYCL using ComputeCpp™ Community Edition, a free to use implementation of the SYCL 1.2 standard. Attendees will be shown how to set up ComputeCpp and use it to write their own SYCL code to run on supported GPUs and CPUs.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127296047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this tutorial, we will introduce you to the reconfigurable hardware architecture and programming of Field Programmable Gate Arrays (FPGAs). You will learn why FPGAs have become so popular in recent years, and understand the many advantages of using FPGAs in your HPC application. In particular, we will cover architectural features of FPGAs that make them well suited to many complex operations, including matrix multiplications and convolutions. In addition, we will introduce you to programming FPGAs using the Intel FPGA SDK for OpenCL™, and how specific OpenCL coding techniques can lead to efficient circuits implemented on the FPGA. Finally, we will go over several case studies where FPGAs have shown very competitive performance when programmed using OpenCL, including convolutional neural nets, FFTs, and astronomy de-dispersion algorithms.
{"title":"Harnessing the Power of FPGAs with the Intel FPGA SDK for OpenCL™","authors":"Byron Sinclair, A. Ling, Genady Paikin","doi":"10.1145/3078155.3078168","DOIUrl":"https://doi.org/10.1145/3078155.3078168","url":null,"abstract":"In this tutorial, we will introduce you to the reconfigurable hardware architecture and programming of Field Programmable Gate Arrays (FPGAs). You will learn why FPGAs have become so popular in recent years, and understand the many advantages of using FPGAs in your HPC application. In particular, we will cover architectural features of FPGAs that make them well suited to many complex operations, including matrix multiplications and convolutions. In addition, we will introduce you to programming FPGAs using the Intel FPGA SDK for OpenCL™, and how specific OpenCL coding techniques can lead to efficient circuits implemented on the FPGA. Finally, we will go over several case studies where FPGAs have shown very competitive performance when programmed using OpenCL, including convolutional neural nets, FFTs, and astronomy de-dispersion algorithms.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130725895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent advancements in High Performance Computing and ongoing research to reach Exascale has been heavily supported by introducing dedicated massively parallel accelerators. Programmers wishing to maximize utilization of current supercomputers are required to develop software which not only involves scaling across multiple nodes but are capable of offloading data-parallel computation to dedicated hardware such as graphic processors. Introduction of new types of hardware has been followed by developing new languages, extensions, compilers and libraries. Unfortunately, none of those solutions seem to be fully portable and independent from specific vendor and type of hardware. HPX.Compute, a programming model developed on top of HPX, a C++ standards library for concurrency and parallelism, uses existing and proposed C++ language and library capabilities to support various types of parallelism. It aims to provide a generic interface allowing for writing code which is portable between hardware architectures. We have implemented a new backend for HPX.Compute based on SYCL, a Khronos standard for single-source programming of OpenCL devices in C++. We present how this runtime may be used to target OpenCL devices through our C++ API. We have evaluated performance of new implementation on graphic processors with STREAM benchmark and compare results with existing CUDA-based implementation.
{"title":"Using SYCL as an Implementation Framework for HPX.Compute","authors":"Marcin Copik, Hartmut Kaiser","doi":"10.1145/3078155.3078187","DOIUrl":"https://doi.org/10.1145/3078155.3078187","url":null,"abstract":"The recent advancements in High Performance Computing and ongoing research to reach Exascale has been heavily supported by introducing dedicated massively parallel accelerators. Programmers wishing to maximize utilization of current supercomputers are required to develop software which not only involves scaling across multiple nodes but are capable of offloading data-parallel computation to dedicated hardware such as graphic processors. Introduction of new types of hardware has been followed by developing new languages, extensions, compilers and libraries. Unfortunately, none of those solutions seem to be fully portable and independent from specific vendor and type of hardware. HPX.Compute, a programming model developed on top of HPX, a C++ standards library for concurrency and parallelism, uses existing and proposed C++ language and library capabilities to support various types of parallelism. It aims to provide a generic interface allowing for writing code which is portable between hardware architectures. We have implemented a new backend for HPX.Compute based on SYCL, a Khronos standard for single-source programming of OpenCL devices in C++. We present how this runtime may be used to target OpenCL devices through our C++ API. We have evaluated performance of new implementation on graphic processors with STREAM benchmark and compare results with existing CUDA-based implementation.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126786219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In our work with developing an OpenCL platform for FPGAs, we observed that the way that OpenCL is currently used on FPGAs does not expose the full capability of FPGAs to the programmer. In particular, FPGAs are spatial devices that can be partitioned by area with each partition programmed with a different function. The latest FPGAs can even be reconfigured dynamically such that one partition of the FPGA can be configured while the rest of the FPGA is still in use. The analogy with GPUs is that an OpenCL programmer can partition a GPU into multiple device objects, execute different kernels on each device object, and reprogram the device objects. An OpenCL programmer cannot do this with an FPGA even though the capability exists. As FPGA capacities continue to increase, the ability to partition and partially reconfigure the FPGA will become even more desirable. The fundamental issue is how FPGAs are currently viewed as devices in the OpenCL model. In this paper, we propose a small change to the OpenCL definition of a device that unlocks the full potential of FPGAs to the programmer.
{"title":"Enabling FPGAs as a True Device in the OpenCL Standard: Bridging the Gap for FPGAs","authors":"Vincent Mirian, P. Chow","doi":"10.1145/3078155.3078176","DOIUrl":"https://doi.org/10.1145/3078155.3078176","url":null,"abstract":"In our work with developing an OpenCL platform for FPGAs, we observed that the way that OpenCL is currently used on FPGAs does not expose the full capability of FPGAs to the programmer. In particular, FPGAs are spatial devices that can be partitioned by area with each partition programmed with a different function. The latest FPGAs can even be reconfigured dynamically such that one partition of the FPGA can be configured while the rest of the FPGA is still in use. The analogy with GPUs is that an OpenCL programmer can partition a GPU into multiple device objects, execute different kernels on each device object, and reprogram the device objects. An OpenCL programmer cannot do this with an FPGA even though the capability exists. As FPGA capacities continue to increase, the ability to partition and partially reconfigure the FPGA will become even more desirable. The fundamental issue is how FPGAs are currently viewed as devices in the OpenCL model. In this paper, we propose a small change to the OpenCL definition of a device that unlocks the full potential of FPGAs to the programmer.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123306841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine learning is being used in more and more artificial intelligence applications. While existing machine learning frameworks mostly support NVIDIA CUDA GPUs, there has been little research dedicated to targeting other devices through open standards such as OpenCL. In this paper, we explain how machine learning applications can harness the power of OpenCL using open standards and how, by using SYCL, TensorFlow can be extended to include customized operations running on OpenCL devices.
机器学习在越来越多的人工智能应用中得到应用。虽然现有的机器学习框架大多支持NVIDIA CUDA gpu,但很少有研究致力于通过开放标准(如OpenCL)瞄准其他设备。在本文中,我们解释了机器学习应用程序如何使用开放标准来利用OpenCL的强大功能,以及如何通过使用SYCL,将TensorFlow扩展到包括在OpenCL设备上运行的定制操作。
{"title":"Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices","authors":"M. Goli, L. Iwanski, A. Richards","doi":"10.1145/3078155.3078160","DOIUrl":"https://doi.org/10.1145/3078155.3078160","url":null,"abstract":"Machine learning is being used in more and more artificial intelligence applications. While existing machine learning frameworks mostly support NVIDIA CUDA GPUs, there has been little research dedicated to targeting other devices through open standards such as OpenCL. In this paper, we explain how machine learning applications can harness the power of OpenCL using open standards and how, by using SYCL, TensorFlow can be extended to include customized operations running on OpenCL devices.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117173660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}