Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with sizes up to 512 × 512 × 800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200, 000 × 100, 000 pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to handle medical image data. The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids. FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. The advantages of this is that it’s easy to change data sources and processing steps. The same pipeline used to process an ultrasound image on disk, can be used to process a real-time stream of ultrasound images. Today FAST pipelines can be created with C++, Python 3 and even without any programming using simple text files. The pipeline approach also opens up possibilities for load balancing and tuning based on analyzing the pipeline as computational graphs, although this has not yet been implemented. In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the us
{"title":"FAST: A framework for high-performance medical image computing and visualization","authors":"E. Smistad","doi":"10.1145/3456669.3456717","DOIUrl":"https://doi.org/10.1145/3456669.3456717","url":null,"abstract":"Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with sizes up to 512 × 512 × 800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200, 000 × 100, 000 pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to handle medical image data. The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids. FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. The advantages of this is that it’s easy to change data sources and processing steps. The same pipeline used to process an ultrasound image on disk, can be used to process a real-time stream of ultrasound images. Today FAST pipelines can be created with C++, Python 3 and even without any programming using simple text files. The pipeline approach also opens up possibilities for load balancing and tuning based on analyzing the pipeline as computational graphs, although this has not yet been implemented. In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the us","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89270950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To support full functionality of two compile-time extensions were added that are safe for (only used in metaprogramming) no extra functionality is needed on conformant OpenCL devices [4]. The extensions are required for: Specifying pointers to functions in is_member_function_pointer; Specifying variadic prototypes in result_of, invoke_result, is_invocable, is_nothrow_invocable, is_member_function_pointer. clang -cl-std=clc++ -I/include -DN=10 test.cl
{"title":"Experimenting with C++ libraries in OpenCL kernel code","authors":"Ole Strohm, Anastasia Stulova","doi":"10.1145/3456669.3456675","DOIUrl":"https://doi.org/10.1145/3456669.3456675","url":null,"abstract":"To support full functionality of <type_traits> two compile-time extensions were added that are safe for <type_traits> (only used in metaprogramming) no extra functionality is needed on conformant OpenCL devices [4]. The extensions are required for: Specifying pointers to functions in is_member_function_pointer; Specifying variadic prototypes in result_of, invoke_result, is_invocable, is_nothrow_invocable, is_member_function_pointer. clang -cl-std=clc++ -I<path to libcxx>/include -DN=10 test.cl","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87362824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open standards are being looked at as an attractive alternative to proprietary solutions by the automotive domain to enable sensor fusion systems in cheap mass-market vehicles. Open standards specification for SYCL, OpenCL and Vulkan were not always designed with safety in mind, yet they could be at the centre of tomorrows highly critical systems in a vehicle.
{"title":"Can SYCL and OpenCL meet the challenges of functional safety?","authors":"Rod Burns, Illya Rudkin","doi":"10.1145/3456669.3456688","DOIUrl":"https://doi.org/10.1145/3456669.3456688","url":null,"abstract":"Open standards are being looked at as an attractive alternative to proprietary solutions by the automotive domain to enable sensor fusion systems in cheap mass-market vehicles. Open standards specification for SYCL, OpenCL and Vulkan were not always designed with safety in mind, yet they could be at the centre of tomorrows highly critical systems in a vehicle.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88842852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rod Burns, I. Vorobtsov, Aksel Alpay, R. Keryell, Michael Steyer, Gavin Brown
SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises. Topics Covered Include:
{"title":"A Hands-On Introduction To SYCL","authors":"Rod Burns, I. Vorobtsov, Aksel Alpay, R. Keryell, Michael Steyer, Gavin Brown","doi":"10.1145/3456669.3456682","DOIUrl":"https://doi.org/10.1145/3456669.3456682","url":null,"abstract":"SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to a platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises. Topics Covered Include:","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77973354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siva Rama Krishna Reddy, Hongqiang Wang, Adarsh Golikeri, Alex Bourd
{"title":"Machine learning training with Tensor Virtual Machine (TVM) and Adreno GPUs","authors":"Siva Rama Krishna Reddy, Hongqiang Wang, Adarsh Golikeri, Alex Bourd","doi":"10.1145/3456669.3456702","DOIUrl":"https://doi.org/10.1145/3456669.3456702","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"29 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72599481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. The Khronos SYCL [3] standard defines an abstract programming model for heterogeneous computing. The oneAPI Specification [10] and at its core the DPC++ programming language [9] are built on top of the SYCL standards. In this presentation, we will be reviewing the implementation steps taken to add the support for the Huawei Ascend AI Chipset to DPC++.
{"title":"Extending DPC++ with Support for Huawei Ascend AI Chipset","authors":"W. Feng, Rasool Maghareh, Kai-Ting Amy Wang","doi":"10.1145/3456669.3456684","DOIUrl":"https://doi.org/10.1145/3456669.3456684","url":null,"abstract":"Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. The Khronos SYCL [3] standard defines an abstract programming model for heterogeneous computing. The oneAPI Specification [10] and at its core the DPC++ programming language [9] are built on top of the SYCL standards. In this presentation, we will be reviewing the implementation steps taken to add the support for the Huawei Ascend AI Chipset to DPC++.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"144 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77534171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek Bagusetty, Jinsung Kim, Ajay Panyala, Á. Vázquez-Mayagoitia, K. Kowalski, S. Krishnamoorthy
{"title":"Approaching Coupled Cluster Theory with Perturbative Triples using SYCL","authors":"Abhishek Bagusetty, Jinsung Kim, Ajay Panyala, Á. Vázquez-Mayagoitia, K. Kowalski, S. Krishnamoorthy","doi":"10.1145/3456669.3456700","DOIUrl":"https://doi.org/10.1145/3456669.3456700","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78246493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"hipSYCL in 2021: Peculiarities, unique features and SYCL 2020","authors":"Aksel Alpay, V. Heuveline","doi":"10.1145/3456669.3456691","DOIUrl":"https://doi.org/10.1145/3456669.3456691","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78687299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Wang, Yongfa Zhou, Q. Wang, Wang Yang, Qing Xu, Chen Wang
The Diagnostic ultrasound is a rapidly developing imaging technology that is widely used in the clinic. A typical ultrasound imaging pipeline including the following algorithms: beamforming, Envelope detection, log-compression, and scan-conversion [1]. In tradition, ultrasound imaging is implemented using Application-specific integrated circuits (ASICs) and FPGAs due to its high throughput and massive data processing requirements. With the development of the GPGPU and its programming environments (e.g. CUDA), researchers use software to implement ultrasound imaging algorithms [2], [3]. For now, the two limiting factors of developing ultrasound imaging are: First, using a hardware development approach to implement ultrasound imaging algorithms is complex, time-consuming and lacks flexibility. Second, the existing CUDA-based ultrasound imaging implementations are limited to Nvidia hardware, which is also a restriction applying more architectures. oneAPI is a cross-platform and unified programming environment developed by intel. It enables heterogeneous computing across multiple hardware architectures using Data Parallel C++ (DPC++). This new programming suite can be used to address the problems mentioned above. To be clear, using a high-level language like DPC++ to program FPGA can accelerate ultrasound imaging application development. SYCL-based ultrasound imaging applications can be easily migrated to other vendor's hardware. To implement an ultrasound imaging application across multiple architectures (e.g., GPU, FPGA, and CPU) in a unified programming environment. We migrated a CUDA-based open-source ultrasound imaging project SUPRA [4]. The migration process was performed using oneAPI compatibility tool (e.g. dpct). After migration, the code was tuned to run on GPU, FPGA, and CPU. In this talk, we will discuss our experiences with the complete process of migrating a CUDA code to oneAPI code. First, the whole process of migrating CUDA code base using the dpct will be presented, including usage, code modification, API comparison and build instruction. Second, the ultrasound imaging algorithms’ computation characteristics will be analyzed, and we will show how to optimize the application on Intel GPUs, Including ESIDM usage. Third, the early experiences of tuning the migrated code to target FPGA will be highlighted, this will include device code rewrite for FPGA and programming skills to improve performance on FPGA. The device code comparison of GPU and FPGA will also be discussed. Last, we will compare ultrasound imaging algorithms performance and computation results on different hardware, including Intel GPU (integrated GPU and discrete GPU), Intel Arria 10 FPGA, Intel CPU, Nvidia GTX 1080 GPU, and GTX 960M GPU.
{"title":"Developing medical ultrasound imaging application across GPU, FPGA, and CPU using oneAPI","authors":"Yong Wang, Yongfa Zhou, Q. Wang, Wang Yang, Qing Xu, Chen Wang","doi":"10.1145/3456669.3456680","DOIUrl":"https://doi.org/10.1145/3456669.3456680","url":null,"abstract":"The Diagnostic ultrasound is a rapidly developing imaging technology that is widely used in the clinic. A typical ultrasound imaging pipeline including the following algorithms: beamforming, Envelope detection, log-compression, and scan-conversion [1]. In tradition, ultrasound imaging is implemented using Application-specific integrated circuits (ASICs) and FPGAs due to its high throughput and massive data processing requirements. With the development of the GPGPU and its programming environments (e.g. CUDA), researchers use software to implement ultrasound imaging algorithms [2], [3]. For now, the two limiting factors of developing ultrasound imaging are: First, using a hardware development approach to implement ultrasound imaging algorithms is complex, time-consuming and lacks flexibility. Second, the existing CUDA-based ultrasound imaging implementations are limited to Nvidia hardware, which is also a restriction applying more architectures. oneAPI is a cross-platform and unified programming environment developed by intel. It enables heterogeneous computing across multiple hardware architectures using Data Parallel C++ (DPC++). This new programming suite can be used to address the problems mentioned above. To be clear, using a high-level language like DPC++ to program FPGA can accelerate ultrasound imaging application development. SYCL-based ultrasound imaging applications can be easily migrated to other vendor's hardware. To implement an ultrasound imaging application across multiple architectures (e.g., GPU, FPGA, and CPU) in a unified programming environment. We migrated a CUDA-based open-source ultrasound imaging project SUPRA [4]. The migration process was performed using oneAPI compatibility tool (e.g. dpct). After migration, the code was tuned to run on GPU, FPGA, and CPU. In this talk, we will discuss our experiences with the complete process of migrating a CUDA code to oneAPI code. First, the whole process of migrating CUDA code base using the dpct will be presented, including usage, code modification, API comparison and build instruction. Second, the ultrasound imaging algorithms’ computation characteristics will be analyzed, and we will show how to optimize the application on Intel GPUs, Including ESIDM usage. Third, the early experiences of tuning the migrated code to target FPGA will be highlighted, this will include device code rewrite for FPGA and programming skills to improve performance on FPGA. The device code comparison of GPU and FPGA will also be discussed. Last, we will compare ultrasound imaging algorithms performance and computation results on different hardware, including Intel GPU (integrated GPU and discrete GPU), Intel Arria 10 FPGA, Intel CPU, Nvidia GTX 1080 GPU, and GTX 960M GPU.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88930572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rod Burns, S. Larsen, B. Cook, D. Doerfler, Kevin G. Harms, T. Applencourt, Stuart Adams
{"title":"Bringing SYCL to Ampere architecture","authors":"Rod Burns, S. Larsen, B. Cook, D. Doerfler, Kevin G. Harms, T. Applencourt, Stuart Adams","doi":"10.1145/3456669.3456685","DOIUrl":"https://doi.org/10.1145/3456669.3456685","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87605619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}