首页 > 最新文献

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems最新文献

英文 中文
OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices OpenMP到CUDA图形:一个基于编译器的转换,以增强NVIDIA设备的可编程性
Chen Yu, Sara Royuela, E. Quiñones
Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.
异构计算越来越多地应用于各种计算系统,从高性能计算到实时嵌入式领域,以满足性能要求。由于各种各样的加速器,例如,fpga, gpu,使用高级并行编程模型是理想的,以利用它们的性能能力,同时保持足够的生产力水平。在这方面,OpenMP是一个著名的高级编程模型,它结合了强大的任务和加速器模型,能够有效地利用异构计算中的结构化和非结构化并行性。本文提出了一种新的编译器转换技术,该技术将OpenMP代码自动转换为CUDA图形,结合了OpenMP等高级编程模型的可编程性优势和CUDA等低级编程模型的性能优势。分别在HPC和嵌入式领域的两个NVIDIA gpu上进行了评估,即V100和Jetson AGX。
{"title":"OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices","authors":"Chen Yu, Sara Royuela, E. Quiñones","doi":"10.1145/3378678.3391881","DOIUrl":"https://doi.org/10.1145/3378678.3391881","url":null,"abstract":"Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127700191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Programming tensor cores from an image processing DSL 从图像处理DSL编程张量核
Savvas Sioutas, S. Stuijk, T. Basten, L. Somers, H. Corporaal
Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing significant speedups for specific applications, their programmability remains difficult for the average user. In this paper, we extend the Halide DSL and compiler with the ability to utilize these units when generating code for a CUDA based NVIDIA GPGPU. To this end, we introduce a new scheduling directive along with custom lowering passes that automatically transform a Halide AST in order to be able to generate code for the TCUs. We evaluate the generated code and show that it can achieve over 5X speedup compared to Halide manual schedules without TCU support, while it remains within 20% of the NVIDIA cuBLAS implementations for mixed precision GEMM and within 10% of manual CUDA implementations with WMMA intrinsics.
张量核心(tcu)是NVIDIA首次在Volta微架构中引入的专用单元,旨在加速深度学习和线性代数工作负载的矩阵乘法。虽然这些单元已被证明能够为特定的应用程序提供显著的加速,但它们的可编程性对于普通用户来说仍然很困难。在本文中,我们扩展了Halide DSL和编译器,使其能够在为基于CUDA的NVIDIA GPGPU生成代码时利用这些单元。为此,我们引入了一个新的调度指令以及自定义降低通道,它可以自动转换Halide AST,以便能够为tcu生成代码。我们对生成的代码进行了评估,并表明与没有TCU支持的Halide手动调度相比,它可以实现超过5倍的加速,同时它仍然在NVIDIA cuBLAS实现的混合精度GEMM的20%以内,在手动CUDA实现的WMMA内在特性的10%以内。
{"title":"Programming tensor cores from an image processing DSL","authors":"Savvas Sioutas, S. Stuijk, T. Basten, L. Somers, H. Corporaal","doi":"10.1145/3378678.3391880","DOIUrl":"https://doi.org/10.1145/3378678.3391880","url":null,"abstract":"Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing significant speedups for specific applications, their programmability remains difficult for the average user. In this paper, we extend the Halide DSL and compiler with the ability to utilize these units when generating code for a CUDA based NVIDIA GPGPU. To this end, we introduce a new scheduling directive along with custom lowering passes that automatically transform a Halide AST in order to be able to generate code for the TCUs. We evaluate the generated code and show that it can achieve over 5X speedup compared to Halide manual schedules without TCU support, while it remains within 20% of the NVIDIA cuBLAS implementations for mixed precision GEMM and within 10% of manual CUDA implementations with WMMA intrinsics.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121311006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Configuring loosely time-triggered wireless control software 配置松散的时间触发无线控制软件
Philipp H. Kindt, Sumana Ghosh, S. Chakraborty
In many wireless control networks, sensor data and controller data are exchanged periodically, which requires periodic packet transmissions between the physical plant and the controller. As an alternative, event-triggered control paradigms imply that data is only exchanged when there are significant changes in the state of the plant, e.g., because of disturbances. This is the nature of many IoT scenarios and requires that a receiving device has to listen to the channel for incoming packets during all times. However, especially in mobile networks, in which all devices are battery-powered, continuous scanning would drain the battery quickly and hence, reception needs to be duty-cycled. When optimizing such duty-cycled operation, significant energy savings are possible using intelligent software-enabled communication scheduling. In this paper, we propose a wireless transmission scheme that supports loosely time-triggered control. When optimizing the scheduling of transmissions and reception windows in the communication protocol, our proposed scheme allows for energy-efficient communication without requiring strict clock-synchronization between the devices. We show that such a scheme is practical and can greatly reduce the energy consumption in event-triggered control applications.
在许多无线控制网络中,传感器数据和控制器数据是周期性交换的,这就需要在物理设备和控制器之间进行周期性的分组传输。作为替代方案,事件触发控制范式意味着只有在工厂状态发生重大变化时才交换数据,例如,由于干扰。这是许多物联网场景的本质,并要求接收设备必须始终侦听传入数据包的通道。然而,特别是在移动网络中,所有设备都是电池供电的,连续扫描会很快耗尽电池,因此,接收需要占空比。在优化这种占空比操作时,使用智能软件支持的通信调度可以显著节省能源。在本文中,我们提出了一种支持松散时间触发控制的无线传输方案。在优化通信协议中传输和接收窗口的调度时,我们提出的方案允许节能通信,而不需要设备之间严格的时钟同步。我们证明了这种方案是实用的,并且可以大大降低事件触发控制应用中的能耗。
{"title":"Configuring loosely time-triggered wireless control software","authors":"Philipp H. Kindt, Sumana Ghosh, S. Chakraborty","doi":"10.1145/3378678.3391888","DOIUrl":"https://doi.org/10.1145/3378678.3391888","url":null,"abstract":"In many wireless control networks, sensor data and controller data are exchanged periodically, which requires periodic packet transmissions between the physical plant and the controller. As an alternative, event-triggered control paradigms imply that data is only exchanged when there are significant changes in the state of the plant, e.g., because of disturbances. This is the nature of many IoT scenarios and requires that a receiving device has to listen to the channel for incoming packets during all times. However, especially in mobile networks, in which all devices are battery-powered, continuous scanning would drain the battery quickly and hence, reception needs to be duty-cycled. When optimizing such duty-cycled operation, significant energy savings are possible using intelligent software-enabled communication scheduling. In this paper, we propose a wireless transmission scheme that supports loosely time-triggered control. When optimizing the scheduling of transmissions and reception windows in the communication protocol, our proposed scheme allows for energy-efficient communication without requiring strict clock-synchronization between the devices. We show that such a scheme is practical and can greatly reduce the energy consumption in event-triggered control applications.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116864095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On the implementation and execution of adaptive streaming applications modeled as MADF 以MADF为模型的自适应流应用程序的实现和执行
Sobhan Niknam, Peng Wang, T. Stefanov
It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adaptive streaming applications. However, no attention has been paid on how to implement and execute an application, modeled and analyzed with the MADF model, on a Multi-Processor System-on-Chip, such that the properties of the analysis model are preserved. Therefore, in this paper, we consider this matter and propose a generic parallel implementation and execution approach for adaptive streaming applications modeled with MADF. Our approach can be easily realized on top of existing operating systems while supporting the utilization of a wider range of schedules. In particular, we demonstrate our approach on LITMUSRT as one of the existing real-time extensions of the Linux kernel. Finally, to show the practical applicability of our approach and its conformity to the analysis model, we present a case study using a real-life adaptive streaming application.
研究表明,模式感知数据流(MADF)是一种适合自适应流应用的分析模型。然而,没有注意到如何在多处理器片上系统上实现和执行应用程序,用MADF模型建模和分析,以便保留分析模型的属性。因此,在本文中,我们考虑了这个问题,并提出了一种通用的并行实现和执行方法,用于用MADF建模的自适应流应用程序。我们的方法可以很容易地在现有的操作系统上实现,同时支持使用更广泛的调度。特别地,我们将把LITMUSRT作为Linux内核的现有实时扩展之一来演示我们的方法。最后,为了展示我们的方法的实际适用性及其与分析模型的一致性,我们提出了一个使用现实生活中的自适应流应用程序的案例研究。
{"title":"On the implementation and execution of adaptive streaming applications modeled as MADF","authors":"Sobhan Niknam, Peng Wang, T. Stefanov","doi":"10.1145/3378678.3391876","DOIUrl":"https://doi.org/10.1145/3378678.3391876","url":null,"abstract":"It has been shown that the mode-aware dataflow (MADF) is an advantageous analysis model for adaptive streaming applications. However, no attention has been paid on how to implement and execute an application, modeled and analyzed with the MADF model, on a Multi-Processor System-on-Chip, such that the properties of the analysis model are preserved. Therefore, in this paper, we consider this matter and propose a generic parallel implementation and execution approach for adaptive streaming applications modeled with MADF. Our approach can be easily realized on top of existing operating systems while supporting the utilization of a wider range of schedules. In particular, we demonstrate our approach on LITMUSRT as one of the existing real-time extensions of the Linux kernel. Finally, to show the practical applicability of our approach and its conformity to the analysis model, we present a case study using a real-life adaptive streaming application.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125163688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-layer approaches for improving the dependability of deep learning systems 提高深度学习系统可靠性的跨层方法
Muhammad Abdullah Hanif, L. Hoang, M. Shafique
Deep Neural Networks (DNNs) - the state-of-the-art computational models for many Artificial Intelligence (AI) applications - are inherently compute and resource-intensive and, hence, cannot exploit traditional redundancy-based fault mitigation techniques for enhancing the dependability of DNN-based systems. Therefore, there is a dire need to search for alternate methods that can improve their reliability without high expenditure of resources by exploiting the intrinsic characteristics of these networks. In this paper, we present cross-layer approaches that, based on the intrinsic characteristics of DNNs, employ software and hardware-level modifications for improving the resilience of DNN-based systems to hardware-level faults, e.g., soft errors and permanent faults.
深度神经网络(dnn)——许多人工智能(AI)应用的最先进的计算模型——本质上是计算和资源密集型的,因此,不能利用传统的基于冗余的故障缓解技术来增强基于dnn的系统的可靠性。因此,迫切需要寻找替代方法,通过利用这些网络的内在特征,在不耗费大量资源的情况下提高其可靠性。在本文中,我们提出了跨层方法,基于深度神经网络的内在特征,采用软件和硬件级修改来提高基于深度神经网络的系统对硬件级故障(如软错误和永久故障)的弹性。
{"title":"Cross-layer approaches for improving the dependability of deep learning systems","authors":"Muhammad Abdullah Hanif, L. Hoang, M. Shafique","doi":"10.1145/3378678.3391884","DOIUrl":"https://doi.org/10.1145/3378678.3391884","url":null,"abstract":"Deep Neural Networks (DNNs) - the state-of-the-art computational models for many Artificial Intelligence (AI) applications - are inherently compute and resource-intensive and, hence, cannot exploit traditional redundancy-based fault mitigation techniques for enhancing the dependability of DNN-based systems. Therefore, there is a dire need to search for alternate methods that can improve their reliability without high expenditure of resources by exploiting the intrinsic characteristics of these networks. In this paper, we present cross-layer approaches that, based on the intrinsic characteristics of DNNs, employ software and hardware-level modifications for improving the resilience of DNN-based systems to hardware-level faults, e.g., soft errors and permanent faults.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124356442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scheduling of moldable fork-join tasks with inter- and intra-task communications 具有任务间和任务内通信的可塑分叉连接任务的调度
Hiroki Nishikawa, Kaname Shimada, Ittetsu Taniguchi, H. Tomiyama
This paper proposes scheduling techniques for moldable fork-join tasks on multicore architecture. The proposed techniques decide the number of cores and execution start time for each task during scheduling and mapping, with taking into account inter- and intra-task communications. The proposed techniques based on integer programming formulation aim at minimization of the overall schedule length. Experimental results are compared with the state-of-the-art techniques.
提出了多核架构下可塑fork-join任务的调度技术。在调度和映射过程中,考虑到任务间和任务内部的通信,所提出的技术决定了每个任务的核数和执行开始时间。所提出的基于整数规划公式的技术旨在使总进度长度最小化。实验结果与最新技术进行了比较。
{"title":"Scheduling of moldable fork-join tasks with inter- and intra-task communications","authors":"Hiroki Nishikawa, Kaname Shimada, Ittetsu Taniguchi, H. Tomiyama","doi":"10.1145/3378678.3391875","DOIUrl":"https://doi.org/10.1145/3378678.3391875","url":null,"abstract":"This paper proposes scheduling techniques for moldable fork-join tasks on multicore architecture. The proposed techniques decide the number of cores and execution start time for each task during scheduling and mapping, with taking into account inter- and intra-task communications. The proposed techniques based on integer programming formulation aim at minimization of the overall schedule length. Experimental results are compared with the state-of-the-art techniques.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"240 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121686119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A secure hardware-software solution based on RISC-V, logic locking and microkernel 基于RISC-V、逻辑锁定和微内核的安全软硬件解决方案
Dominik Sisejkovic, Farhad Merchant, Lennart M. Reimann, R. Leupers, M. Giacometti, Sascha Kegreiss
In this paper we present the first generation of a secure platform developed by following a security-by-design approach. The security of the platform is built on top of two pillars: a secured hardware design flow and a secure microkernel. The hardware design is protected against the insertion of hardware Trojans during the production phase through netlist obfuscation provided by logic locking. The software stack is based on a trustworthy and verified microkernel. Moreover, the system is expected to work in an environment which does not allow physical access to the device. Therefore, on-the-field attacks are only possible via software. We present a solution whose security has been achieved by relying on simple and open hardware and software solutions, namely a RISC-V processor core, open-source peripherals and an seL4--based operating system.
在本文中,我们介绍了通过遵循设计安全方法开发的第一代安全平台。平台的安全性建立在两个支柱之上:一个安全的硬件设计流程和一个安全的微内核。通过逻辑锁定提供的网表混淆,在生产阶段保护硬件设计免受硬件木马的插入。软件栈是基于一个值得信赖的和经过验证的微内核。此外,该系统预计将在不允许物理访问设备的环境中工作。因此,现场攻击只能通过软件实现。我们提出了一种通过简单开放的硬件和软件解决方案来实现安全性的解决方案,即RISC-V处理器核心,开源外围设备和基于seL4的操作系统。
{"title":"A secure hardware-software solution based on RISC-V, logic locking and microkernel","authors":"Dominik Sisejkovic, Farhad Merchant, Lennart M. Reimann, R. Leupers, M. Giacometti, Sascha Kegreiss","doi":"10.1145/3378678.3391886","DOIUrl":"https://doi.org/10.1145/3378678.3391886","url":null,"abstract":"In this paper we present the first generation of a secure platform developed by following a security-by-design approach. The security of the platform is built on top of two pillars: a secured hardware design flow and a secure microkernel. The hardware design is protected against the insertion of hardware Trojans during the production phase through netlist obfuscation provided by logic locking. The software stack is based on a trustworthy and verified microkernel. Moreover, the system is expected to work in an environment which does not allow physical access to the device. Therefore, on-the-field attacks are only possible via software. We present a solution whose security has been achieved by relying on simple and open hardware and software solutions, namely a RISC-V processor core, open-source peripherals and an seL4--based operating system.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114430558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Reviewing inference performance of state-of-the-art deep learning frameworks 回顾最先进的深度学习框架的推理性能
Berk Ulker, S. Stuijk, H. Corporaal, R. Wijnhoven
Deep learning models have replaced conventional methods for machine learning tasks. Efficient inference on edge devices with limited resources is key for broader deployment. In this work, we focus on the tool selection challenge for inference deployment. We present an extensive evaluation of the inference performance of deep learning software tools using state-of-the-art CNN architectures for multiple hardware platforms. We benchmark these hardware-software pairs for a broad range of network architectures, inference batch sizes, and floating-point precision, focusing on latency and throughput. Our results reveal interesting combinations for optimal tool selection, resulting in different optima when considering minimum latency and maximum throughput.
深度学习模型已经取代了机器学习任务的传统方法。在资源有限的边缘设备上进行有效的推理是更广泛部署的关键。在这项工作中,我们关注推理部署的工具选择挑战。我们在多个硬件平台上使用最先进的CNN架构,对深度学习软件工具的推理性能进行了广泛的评估。我们针对广泛的网络架构、推理批处理大小和浮点精度对这些硬件软件对进行基准测试,重点关注延迟和吞吐量。我们的结果揭示了最优工具选择的有趣组合,在考虑最小延迟和最大吞吐量时产生不同的最优选择。
{"title":"Reviewing inference performance of state-of-the-art deep learning frameworks","authors":"Berk Ulker, S. Stuijk, H. Corporaal, R. Wijnhoven","doi":"10.1145/3378678.3391882","DOIUrl":"https://doi.org/10.1145/3378678.3391882","url":null,"abstract":"Deep learning models have replaced conventional methods for machine learning tasks. Efficient inference on edge devices with limited resources is key for broader deployment. In this work, we focus on the tool selection challenge for inference deployment. We present an extensive evaluation of the inference performance of deep learning software tools using state-of-the-art CNN architectures for multiple hardware platforms. We benchmark these hardware-software pairs for a broad range of network architectures, inference batch sizes, and floating-point precision, focusing on latency and throughput. Our results reveal interesting combinations for optimal tool selection, resulting in different optima when considering minimum latency and maximum throughput.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121853709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Real-time audio processing for hearing aids using a model-based bayesian inference framework 基于模型的贝叶斯推理框架的助听器实时音频处理
M. Roa-Villescas, B. Vries, S. Stuijk, H. Corporaal
Development of hearing aid (HA) signal processing algorithms entails an iterative process between two design steps, namely algorithm development and the embedded implementation. Algorithm designers favor high-level programming languages for several reasons including higher productivity, code readability and, perhaps most importantly, availability of state-of-the-art signal processing frameworks that open new research directions. Embedded software, on the other hand, is preferably implemented using a low-level programming language to allow finer control of the hardware, an essential trait in real-time processing applications. In this paper we present a technique that allows deploying DSP algorithms written in Julia, a modern high-level programming language, on a real-time HA processing platform known as openMHA. We demonstrate this technique by using a model-based Bayesian inference framework to perform real-time audio processing.
助听器(HA)信号处理算法的开发需要两个设计步骤之间的迭代过程,即算法开发和嵌入式实现。算法设计者偏爱高级编程语言有几个原因,包括更高的生产率,代码可读性,也许最重要的是,最先进的信号处理框架的可用性开辟了新的研究方向。另一方面,嵌入式软件最好使用低级编程语言来实现,以允许对硬件进行更精细的控制,这是实时处理应用程序的基本特征。在本文中,我们提出了一种技术,允许在称为openMHA的实时HA处理平台上部署用Julia(一种现代高级编程语言)编写的DSP算法。我们通过使用基于模型的贝叶斯推理框架来执行实时音频处理来演示这种技术。
{"title":"Real-time audio processing for hearing aids using a model-based bayesian inference framework","authors":"M. Roa-Villescas, B. Vries, S. Stuijk, H. Corporaal","doi":"10.1145/3378678.3397528","DOIUrl":"https://doi.org/10.1145/3378678.3397528","url":null,"abstract":"Development of hearing aid (HA) signal processing algorithms entails an iterative process between two design steps, namely algorithm development and the embedded implementation. Algorithm designers favor high-level programming languages for several reasons including higher productivity, code readability and, perhaps most importantly, availability of state-of-the-art signal processing frameworks that open new research directions. Embedded software, on the other hand, is preferably implemented using a low-level programming language to allow finer control of the hardware, an essential trait in real-time processing applications. In this paper we present a technique that allows deploying DSP algorithms written in Julia, a modern high-level programming language, on a real-time HA processing platform known as openMHA. We demonstrate this technique by using a model-based Bayesian inference framework to perform real-time audio processing.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123950975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Exploration of GPU sharing policies under GEMM workloads GEMM工作负载下GPU共享策略的探索
Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris
Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support "one accelerator per user" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.
最近,由于云计算提供的灵活性和可伸缩性,它出现了爆炸式的增长。不断增长的计算需求,特别是来自机器学习领域的需求,迫使云运营商使用加速设备(如通用(GP) gpu或fpga)来增强其基础设施。尽管对传统cpu的多租户已经进行了广泛的研究,但加速器的情况并非如此。当前的解决方案支持“每个用户一个加速器”方案,这可能导致可用资源利用率不足和缺乏。在这项工作中,我们分析了在数据中心环境中GPU共享的潜力。我们研究了几种架构特性在不同的多租户压力场景下如何影响gpu的性能。我们将CUDA MPS与本地默认CUDA调度器以及提供GPU共享功能的研究框架Vinetalk进行了比较。实验结果表明,与原生CUDA调度器和Vinetalk相比,NVIDIA的MPS在多应用场景下实现了最佳性能,特别是高达X4.5和X11.2。
{"title":"Exploration of GPU sharing policies under GEMM workloads","authors":"Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris","doi":"10.1145/3378678.3391887","DOIUrl":"https://doi.org/10.1145/3378678.3391887","url":null,"abstract":"Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support \"one accelerator per user\" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"604 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131427943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1