Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.
{"title":"OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices","authors":"Chen Yu, Sara Royuela, E. Quiñones","doi":"10.1145/3378678.3391881","DOIUrl":"https://doi.org/10.1145/3378678.3391881","url":null,"abstract":"Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127700191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Savvas Sioutas, S. Stuijk, T. Basten, L. Somers, H. Corporaal
Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing significant speedups for specific applications, their programmability remains difficult for the average user. In this paper, we extend the Halide DSL and compiler with the ability to utilize these units when generating code for a CUDA based NVIDIA GPGPU. To this end, we introduce a new scheduling directive along with custom lowering passes that automatically transform a Halide AST in order to be able to generate code for the TCUs. We evaluate the generated code and show that it can achieve over 5X speedup compared to Halide manual schedules without TCU support, while it remains within 20% of the NVIDIA cuBLAS implementations for mixed precision GEMM and within 10% of manual CUDA implementations with WMMA intrinsics.
{"title":"Programming tensor cores from an image processing DSL","authors":"Savvas Sioutas, S. Stuijk, T. Basten, L. Somers, H. Corporaal","doi":"10.1145/3378678.3391880","DOIUrl":"https://doi.org/10.1145/3378678.3391880","url":null,"abstract":"Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing significant speedups for specific applications, their programmability remains difficult for the average user. In this paper, we extend the Halide DSL and compiler with the ability to utilize these units when generating code for a CUDA based NVIDIA GPGPU. To this end, we introduce a new scheduling directive along with custom lowering passes that automatically transform a Halide AST in order to be able to generate code for the TCUs. We evaluate the generated code and show that it can achieve over 5X speedup compared to Halide manual schedules without TCU support, while it remains within 20% of the NVIDIA cuBLAS implementations for mixed precision GEMM and within 10% of manual CUDA implementations with WMMA intrinsics.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121311006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many wireless control networks, sensor data and controller data are exchanged periodically, which requires periodic packet transmissions between the physical plant and the controller. As an alternative, event-triggered control paradigms imply that data is only exchanged when there are significant changes in the state of the plant, e.g., because of disturbances. This is the nature of many IoT scenarios and requires that a receiving device has to listen to the channel for incoming packets during all times. However, especially in mobile networks, in which all devices are battery-powered, continuous scanning would drain the battery quickly and hence, reception needs to be duty-cycled. When optimizing such duty-cycled operation, significant energy savings are possible using intelligent software-enabled communication scheduling. In this paper, we propose a wireless transmission scheme that supports loosely time-triggered control. When optimizing the scheduling of transmissions and reception windows in the communication protocol, our proposed scheme allows for energy-efficient communication without requiring strict clock-synchronization between the devices. We show that such a scheme is practical and can greatly reduce the energy consumption in event-triggered control applications.
{"title":"Configuring loosely time-triggered wireless control software","authors":"Philipp H. Kindt, Sumana Ghosh, S. Chakraborty","doi":"10.1145/3378678.3391888","DOIUrl":"https://doi.org/10.1145/3378678.3391888","url":null,"abstract":"In many wireless control networks, sensor data and controller data are exchanged periodically, which requires periodic packet transmissions between the physical plant and the controller. As an alternative, event-triggered control paradigms imply that data is only exchanged when there are significant changes in the state of the plant, e.g., because of disturbances. This is the nature of many IoT scenarios and requires that a receiving device has to listen to the channel for incoming packets during all times. However, especially in mobile networks, in which all devices are battery-powered, continuous scanning would drain the battery quickly and hence, reception needs to be duty-cycled. When optimizing such duty-cycled operation, significant energy savings are possible using intelligent software-enabled communication scheduling. In this paper, we propose a wireless transmission scheme that supports loosely time-triggered control. When optimizing the scheduling of transmissions and reception windows in the communication protocol, our proposed scheme allows for energy-efficient communication without requiring strict clock-synchronization between the devices. We show that such a scheme is practical and can greatly reduce the energy consumption in event-triggered control applications.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116864095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adaptive streaming applications. However, no attention has been paid on how to implement and execute an application, modeled and analyzed with the MADF model, on a Multi-Processor System-on-Chip, such that the properties of the analysis model are preserved. Therefore, in this paper, we consider this matter and propose a generic parallel implementation and execution approach for adaptive streaming applications modeled with MADF. Our approach can be easily realized on top of existing operating systems while supporting the utilization of a wider range of schedules. In particular, we demonstrate our approach on LITMUSRT as one of the existing real-time extensions of the Linux kernel. Finally, to show the practical applicability of our approach and its conformity to the analysis model, we present a case study using a real-life adaptive streaming application.
{"title":"On the implementation and execution of adaptive streaming applications modeled as MADF","authors":"Sobhan Niknam, Peng Wang, T. Stefanov","doi":"10.1145/3378678.3391876","DOIUrl":"https://doi.org/10.1145/3378678.3391876","url":null,"abstract":"It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adaptive streaming applications. However, no attention has been paid on how to implement and execute an application, modeled and analyzed with the MADF model, on a Multi-Processor System-on-Chip, such that the properties of the analysis model are preserved. Therefore, in this paper, we consider this matter and propose a generic parallel implementation and execution approach for adaptive streaming applications modeled with MADF. Our approach can be easily realized on top of existing operating systems while supporting the utilization of a wider range of schedules. In particular, we demonstrate our approach on LITMUSRT as one of the existing real-time extensions of the Linux kernel. Finally, to show the practical applicability of our approach and its conformity to the analysis model, we present a case study using a real-life adaptive streaming application.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125163688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Neural Networks (DNNs) - the state-of-the-art computational models for many Artificial Intelligence (AI) applications - are inherently compute and resource-intensive and, hence, cannot exploit traditional redundancy-based fault mitigation techniques for enhancing the dependability of DNN-based systems. Therefore, there is a dire need to search for alternate methods that can improve their reliability without high expenditure of resources by exploiting the intrinsic characteristics of these networks. In this paper, we present cross-layer approaches that, based on the intrinsic characteristics of DNNs, employ software and hardware-level modifications for improving the resilience of DNN-based systems to hardware-level faults, e.g., soft errors and permanent faults.
{"title":"Cross-layer approaches for improving the dependability of deep learning systems","authors":"Muhammad Abdullah Hanif, L. Hoang, M. Shafique","doi":"10.1145/3378678.3391884","DOIUrl":"https://doi.org/10.1145/3378678.3391884","url":null,"abstract":"Deep Neural Networks (DNNs) - the state-of-the-art computational models for many Artificial Intelligence (AI) applications - are inherently compute and resource-intensive and, hence, cannot exploit traditional redundancy-based fault mitigation techniques for enhancing the dependability of DNN-based systems. Therefore, there is a dire need to search for alternate methods that can improve their reliability without high expenditure of resources by exploiting the intrinsic characteristics of these networks. In this paper, we present cross-layer approaches that, based on the intrinsic characteristics of DNNs, employ software and hardware-level modifications for improving the resilience of DNN-based systems to hardware-level faults, e.g., soft errors and permanent faults.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124356442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroki Nishikawa, Kaname Shimada, Ittetsu Taniguchi, H. Tomiyama
This paper proposes scheduling techniques for moldable fork-join tasks on multicore architecture. The proposed techniques decide the number of cores and execution start time for each task during scheduling and mapping, with taking into account inter- and intra-task communications. The proposed techniques based on integer programming formulation aim at minimization of the overall schedule length. Experimental results are compared with the state-of-the-art techniques.
{"title":"Scheduling of moldable fork-join tasks with inter- and intra-task communications","authors":"Hiroki Nishikawa, Kaname Shimada, Ittetsu Taniguchi, H. Tomiyama","doi":"10.1145/3378678.3391875","DOIUrl":"https://doi.org/10.1145/3378678.3391875","url":null,"abstract":"This paper proposes scheduling techniques for moldable fork-join tasks on multicore architecture. The proposed techniques decide the number of cores and execution start time for each task during scheduling and mapping, with taking into account inter- and intra-task communications. The proposed techniques based on integer programming formulation aim at minimization of the overall schedule length. Experimental results are compared with the state-of-the-art techniques.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"240 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121686119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dominik Sisejkovic, Farhad Merchant, Lennart M. Reimann, R. Leupers, M. Giacometti, Sascha Kegreiss
In this paper we present the first generation of a secure platform developed by following a security-by-design approach. The security of the platform is built on top of two pillars: a secured hardware design flow and a secure microkernel. The hardware design is protected against the insertion of hardware Trojans during the production phase through netlist obfuscation provided by logic locking. The software stack is based on a trustworthy and verified microkernel. Moreover, the system is expected to work in an environment which does not allow physical access to the device. Therefore, on-the-field attacks are only possible via software. We present a solution whose security has been achieved by relying on simple and open hardware and software solutions, namely a RISC-V processor core, open-source peripherals and an seL4--based operating system.
{"title":"A secure hardware-software solution based on RISC-V, logic locking and microkernel","authors":"Dominik Sisejkovic, Farhad Merchant, Lennart M. Reimann, R. Leupers, M. Giacometti, Sascha Kegreiss","doi":"10.1145/3378678.3391886","DOIUrl":"https://doi.org/10.1145/3378678.3391886","url":null,"abstract":"In this paper we present the first generation of a secure platform developed by following a security-by-design approach. The security of the platform is built on top of two pillars: a secured hardware design flow and a secure microkernel. The hardware design is protected against the insertion of hardware Trojans during the production phase through netlist obfuscation provided by logic locking. The software stack is based on a trustworthy and verified microkernel. Moreover, the system is expected to work in an environment which does not allow physical access to the device. Therefore, on-the-field attacks are only possible via software. We present a solution whose security has been achieved by relying on simple and open hardware and software solutions, namely a RISC-V processor core, open-source peripherals and an seL4--based operating system.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114430558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning models have replaced conventional methods for machine learning tasks. Efficient inference on edge devices with limited resources is key for broader deployment. In this work, we focus on the tool selection challenge for inference deployment. We present an extensive evaluation of the inference performance of deep learning software tools using state-of-the-art CNN architectures for multiple hardware platforms. We benchmark these hardware-software pairs for a broad range of network architectures, inference batch sizes, and floating-point precision, focusing on latency and throughput. Our results reveal interesting combinations for optimal tool selection, resulting in different optima when considering minimum latency and maximum throughput.
{"title":"Reviewing inference performance of state-of-the-art deep learning frameworks","authors":"Berk Ulker, S. Stuijk, H. Corporaal, R. Wijnhoven","doi":"10.1145/3378678.3391882","DOIUrl":"https://doi.org/10.1145/3378678.3391882","url":null,"abstract":"Deep learning models have replaced conventional methods for machine learning tasks. Efficient inference on edge devices with limited resources is key for broader deployment. In this work, we focus on the tool selection challenge for inference deployment. We present an extensive evaluation of the inference performance of deep learning software tools using state-of-the-art CNN architectures for multiple hardware platforms. We benchmark these hardware-software pairs for a broad range of network architectures, inference batch sizes, and floating-point precision, focusing on latency and throughput. Our results reveal interesting combinations for optimal tool selection, resulting in different optima when considering minimum latency and maximum throughput.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121853709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Roa-Villescas, B. Vries, S. Stuijk, H. Corporaal
Development of hearing aid (HA) signal processing algorithms entails an iterative process between two design steps, namely algorithm development and the embedded implementation. Algorithm designers favor high-level programming languages for several reasons including higher productivity, code readability and, perhaps most importantly, availability of state-of-the-art signal processing frameworks that open new research directions. Embedded software, on the other hand, is preferably implemented using a low-level programming language to allow finer control of the hardware, an essential trait in real-time processing applications. In this paper we present a technique that allows deploying DSP algorithms written in Julia, a modern high-level programming language, on a real-time HA processing platform known as openMHA. We demonstrate this technique by using a model-based Bayesian inference framework to perform real-time audio processing.
{"title":"Real-time audio processing for hearing aids using a model-based bayesian inference framework","authors":"M. Roa-Villescas, B. Vries, S. Stuijk, H. Corporaal","doi":"10.1145/3378678.3397528","DOIUrl":"https://doi.org/10.1145/3378678.3397528","url":null,"abstract":"Development of hearing aid (HA) signal processing algorithms entails an iterative process between two design steps, namely algorithm development and the embedded implementation. Algorithm designers favor high-level programming languages for several reasons including higher productivity, code readability and, perhaps most importantly, availability of state-of-the-art signal processing frameworks that open new research directions. Embedded software, on the other hand, is preferably implemented using a low-level programming language to allow finer control of the hardware, an essential trait in real-time processing applications. In this paper we present a technique that allows deploying DSP algorithms written in Julia, a modern high-level programming language, on a real-time HA processing platform known as openMHA. We demonstrate this technique by using a model-based Bayesian inference framework to perform real-time audio processing.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123950975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris
Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support "one accelerator per user" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.
{"title":"Exploration of GPU sharing policies under GEMM workloads","authors":"Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris","doi":"10.1145/3378678.3391887","DOIUrl":"https://doi.org/10.1145/3378678.3391887","url":null,"abstract":"Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support \"one accelerator per user\" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"604 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131427943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}