2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献_第4页

PSU: A Framework for Dynamic Software Updates in Multi-threaded C-Language Programs PSU:多线程c语言程序动态软件更新框架

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00040

Marcus Karpoff, J. N. Amaral, Kai-Ting Amy Wang, Rayson Ho, B. Dobry

A Dynamic Software Update (DSU) system enables an operator to modify a running program without interrupting its execution. However, creating a DSU system to allow programs written in the C programming language to be modified while they are executing is challenging. This paper presents the Portable Software Update (PSU) system, a new framework that allows the creation of C-language DSU programs. PSU offers a simple programming interface to build DSU versions of existing C programs. Once a program is built using PSU, updates can be applied by background threads that have negligible impact on the execution of the program. PSU supports multi-threaded and recursive programs without the use of safe points or thread blocking. PSU uses function indirection to redirect DSU functions calls to the newest version of the function code. Once a DSU function is invoked in a PSU program, it executes to completion using the version of the function that was active when it was invoked. However, if a new version is installed, any future calls to the same function always execute the newest version. This simple mechanism allows for quick loading of updates in PSU. PSU unloads obsolete version of DSU functions after they are no longer executing. This mechanism makes PSU the first DSU system for C-language programs that is able to unload older versions of code. This efficient use of resources enables many patches to be applied to a long-running application. A suite of specialized custom synthetic programs, and a DSU-enabled version of the MySQL database storage engine, are used to evaluate the overhead of the DSU-enabling features. The MySQL storage engine maintains over 95% of the performance of the non-DSU version and allows the entire storage engine to be updated while the database continues executing. PSU includes a simple and straightforward process for the modification of the storage engine that enables DSU.

动态软件更新(DSU)系统使操作人员能够在不中断程序执行的情况下修改正在运行的程序。然而，创建一个DSU系统来允许用C编程语言编写的程序在执行时被修改是具有挑战性的。本文介绍了可移植软件更新系统(Portable Software Update, PSU)，这是一个允许创建c语言DSU程序的新框架。PSU提供了一个简单的编程接口来构建现有C程序的DSU版本。一旦使用PSU构建了程序，就可以由对程序执行影响很小的后台线程应用更新。PSU支持多线程和递归程序，不使用安全点或线程阻塞。PSU使用函数间接将DSU函数调用重定向到最新版本的函数代码。一旦在PSU程序中调用了DSU函数，它将使用调用时活动的函数版本执行直至完成。但是，如果安装了新版本，则以后对同一函数的任何调用总是执行最新版本。这种简单的机制允许在PSU中快速加载更新。PSU卸载不再执行的DSU函数的过时版本。这种机制使PSU成为c语言程序的第一个能够卸载旧版本代码的DSU系统。这种对资源的有效利用使得可以将许多补丁应用于长时间运行的应用程序。使用一套专门定制的合成程序和支持dsu的MySQL数据库存储引擎版本来评估支持dsu的特性的开销。MySQL存储引擎保持了非dsu版本95%以上的性能，并允许在数据库继续执行的同时更新整个存储引擎。PSU包括一个简单而直接的过程，用于修改存储引擎，使DSU。

{"title":"PSU: A Framework for Dynamic Software Updates in Multi-threaded C-Language Programs","authors":"Marcus Karpoff, J. N. Amaral, Kai-Ting Amy Wang, Rayson Ho, B. Dobry","doi":"10.1109/SBAC-PAD49847.2020.00040","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00040","url":null,"abstract":"A Dynamic Software Update (DSU) system enables an operator to modify a running program without interrupting its execution. However, creating a DSU system to allow programs written in the C programming language to be modified while they are executing is challenging. This paper presents the Portable Software Update (PSU) system, a new framework that allows the creation of C-language DSU programs. PSU offers a simple programming interface to build DSU versions of existing C programs. Once a program is built using PSU, updates can be applied by background threads that have negligible impact on the execution of the program. PSU supports multi-threaded and recursive programs without the use of safe points or thread blocking. PSU uses function indirection to redirect DSU functions calls to the newest version of the function code. Once a DSU function is invoked in a PSU program, it executes to completion using the version of the function that was active when it was invoked. However, if a new version is installed, any future calls to the same function always execute the newest version. This simple mechanism allows for quick loading of updates in PSU. PSU unloads obsolete version of DSU functions after they are no longer executing. This mechanism makes PSU the first DSU system for C-language programs that is able to unload older versions of code. This efficient use of resources enables many patches to be applied to a long-running application. A suite of specialized custom synthetic programs, and a DSU-enabled version of the MySQL database storage engine, are used to evaluate the overhead of the DSU-enabling features. The MySQL storage engine maintains over 95% of the performance of the non-DSU version and allows the entire storage engine to be updated while the database continues executing. PSU includes a simple and straightforward process for the modification of the storage engine that enables DSU.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121244916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FFT Optimizations and Performance Assessment Targeted towards Satellite and Airborne Radar Processing 针对卫星和机载雷达处理的FFT优化和性能评估

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00050

Maron Schlemon, J. Naghmouchi

Following the re-invention of the FFT algorithm by Cooley and Tukey in 1965, a lot of effort has been invested into optimization of this algorithm and all its variations. In this paper, we discuss its use and optimization for current and future radar applications, and give a brief survey on implementations that have claimed relatively high advantages in terms of performance over existing solutions. Correspondingly, we present an in-depth analysis of state-ofthe-art solutions and our own implementation that will allow the reader to evaluate the performance improvements on a fair basis. Therefore, we discuss the development of a highperformance Fast Fourier Transform (FFT) using an enhanced Radix-4 decimation in frequency (DIF) algorithm, compare it against the Fastest Fourier Transform in the West (FFTW) autotuned library as well as other solutions and frameworks.

在1965年Cooley和Tukey重新发明FFT算法之后，人们投入了大量精力来优化该算法及其所有变体。在本文中，我们讨论了它在当前和未来雷达应用中的使用和优化，并简要介绍了在性能方面比现有解决方案具有相对较高优势的实现。相应地，我们对最先进的解决方案和我们自己的实现进行了深入分析，使读者能够在公平的基础上评估性能改进。因此，我们讨论了使用增强的基数-4频率抽取(DIF)算法开发高性能快速傅里叶变换(FFT)，并将其与西方最快傅里叶变换(FFTW)自动调谐库以及其他解决方案和框架进行比较。

引用次数: 2

Predicting the Energy Consumption of CUDA Kernels using SimGrid 使用SimGrid预测CUDA内核的能耗

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00035

Dorra Boughzala, L. Lefèvre, Anne-Cécile Orgerie

Building a sustainable Exascale machine is a very promising target in High Performance Computing (HPC). To tackle the energy consumption challenge while continuing to provide tremendous performance, the HPC community have rapidly adopted GPU-based systems. Today, GPUs have became the most prevailing components in the massively parallel HPC landscape thanks to their high computational power and energy efficiency. Modeling the energy consumption of applications running on GPUs has gained a lot of attention for the last years. Alas, the HPC community lacks simple yet accurate simulators to predict the energy consumption of general purpose GPU applications. In this work, we address the prediction of the energy consumption of CUDA kernels via simulation. We propose in this paper a simple and lightweight energy model that we implemented using the open-source framework SimGrid. Our proposed model is validated across a diverse set of CUDA kernels and on two different NVIDIA GPUs (Tesla M2075 and Kepler K20Xm). As our modeling approach is not based on performance counters or detailed-architecture parameters, we believe that our model can be easily approved by users who take care of the energy consumption of their GPGPU applications.

构建可持续的百亿亿级计算机是高性能计算(HPC)领域一个非常有前途的目标。为了解决能源消耗的挑战，同时继续提供巨大的性能，高性能计算社区迅速采用了基于gpu的系统。如今，gpu凭借其强大的计算能力和能源效率，已成为大规模并行高性能计算领域最流行的组件。在过去几年中，对运行在gpu上的应用程序的能耗进行建模获得了很多关注。遗憾的是，HPC社区缺乏简单而准确的模拟器来预测通用GPU应用程序的能耗。在这项工作中，我们通过仿真解决了CUDA内核能耗的预测。我们在本文中提出了一个简单且轻量级的能量模型，我们使用开源框架SimGrid实现了这个模型。我们提出的模型在不同的CUDA内核集和两个不同的NVIDIA gpu (Tesla M2075和Kepler K20Xm)上进行了验证。由于我们的建模方法不是基于性能计数器或详细的架构参数，我们相信我们的模型可以很容易地被关心其GPGPU应用程序能耗的用户认可。

{"title":"Predicting the Energy Consumption of CUDA Kernels using SimGrid","authors":"Dorra Boughzala, L. Lefèvre, Anne-Cécile Orgerie","doi":"10.1109/SBAC-PAD49847.2020.00035","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00035","url":null,"abstract":"Building a sustainable Exascale machine is a very promising target in High Performance Computing (HPC). To tackle the energy consumption challenge while continuing to provide tremendous performance, the HPC community have rapidly adopted GPU-based systems. Today, GPUs have became the most prevailing components in the massively parallel HPC landscape thanks to their high computational power and energy efficiency. Modeling the energy consumption of applications running on GPUs has gained a lot of attention for the last years. Alas, the HPC community lacks simple yet accurate simulators to predict the energy consumption of general purpose GPU applications. In this work, we address the prediction of the energy consumption of CUDA kernels via simulation. We propose in this paper a simple and lightweight energy model that we implemented using the open-source framework SimGrid. Our proposed model is validated across a diverse set of CUDA kernels and on two different NVIDIA GPUs (Tesla M2075 and Kepler K20Xm). As our modeling approach is not based on performance counters or detailed-architecture parameters, we believe that our model can be easily approved by users who take care of the energy consumption of their GPGPU applications.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121827571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Exploiting Non-conventional DVFS on GPUs: Application to Deep Learning 利用gpu上的非常规DVFS:在深度学习中的应用

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00012

Francisco Mendes, P. Tomás, N. Roma

The use of Graphics Processing Units (GPUs) to accelerate Deep Neural Networks (DNNs) training and inference is already widely adopted, allowing for a significant increase in the performance of these applications. However, this increase in performance comes at the cost of a consequent increase in energy consumption. While several solutions have been proposed to perform Voltage-Frequency (V-F) scaling on GPUs, these are still one-dimensional, by simply adjusting frequency while relying on default voltage settings. To overcome this, this paper introduces a methodology to fully characterize the impact of non-conventional Dynamic Voltage and Frequency Scaling (DVFS) in GPUs. The proposed approach was applied to an AMD Vega 10 Frontier Edition GPU. When applying this non-conventional DVFS scheme to DNNs, the obtained results show that it is possible to safely decrease the GPU voltage, allowing for a significant reduction of the energy consumption (up to 38%) and the Energy-Delay Product (EDP) (up to 41%) on the training of CNN models, with no degradation of the networks accuracy.

使用图形处理单元(gpu)来加速深度神经网络(dnn)的训练和推理已经被广泛采用，从而大大提高了这些应用程序的性能。然而，性能的提高是以能源消耗的增加为代价的。虽然已经提出了几种解决方案来在gpu上执行电压-频率(V-F)缩放，但这些仍然是一维的，通过简单地调整频率，同时依赖默认电压设置。为了克服这个问题，本文介绍了一种方法来充分表征gpu中非常规动态电压和频率缩放(DVFS)的影响。提出的方法应用于AMD Vega 10 Frontier Edition GPU。当将这种非传统的DVFS方案应用于dnn时，所获得的结果表明，可以安全地降低GPU电压，从而在CNN模型的训练中显著降低能耗(高达38%)和能量延迟积(EDP)(高达41%)，而不会降低网络的精度。

{"title":"Exploiting Non-conventional DVFS on GPUs: Application to Deep Learning","authors":"Francisco Mendes, P. Tomás, N. Roma","doi":"10.1109/SBAC-PAD49847.2020.00012","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00012","url":null,"abstract":"The use of Graphics Processing Units (GPUs) to accelerate Deep Neural Networks (DNNs) training and inference is already widely adopted, allowing for a significant increase in the performance of these applications. However, this increase in performance comes at the cost of a consequent increase in energy consumption. While several solutions have been proposed to perform Voltage-Frequency (V-F) scaling on GPUs, these are still one-dimensional, by simply adjusting frequency while relying on default voltage settings. To overcome this, this paper introduces a methodology to fully characterize the impact of non-conventional Dynamic Voltage and Frequency Scaling (DVFS) in GPUs. The proposed approach was applied to an AMD Vega 10 Frontier Edition GPU. When applying this non-conventional DVFS scheme to DNNs, the obtained results show that it is possible to safely decrease the GPU voltage, allowing for a significant reduction of the energy consumption (up to 38%) and the Energy-Delay Product (EDP) (up to 41%) on the training of CNN models, with no degradation of the networks accuracy.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128546352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems 关于内存未充分利用:探索HPC系统上的分解内存

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00034

I. Peng, R. Pearce, M. Gokhale

Large-scale high-performance computing (HPC) systems consist of massive compute and memory resources tightly coupled in nodes. We perform a large-scale study of memory utilization on four production HPC clusters. Our results show that more than 90% of jobs utilize less than 15% of the node memory capacity, and for 90% of the time, memory utilization is less than 35%. Recently, disaggregated architecture is gaining traction because it can selectively scale up a resource and improve resource utilization. Based on these observations, we explore using disaggregated memory to support memory-intensive applications, while most jobs remain intact on HPC systems with reduced node memory. We designed and developed a user-space remote-memory paging library to enable applications exploring disaggregated memory on existing HPC clusters. We quantified the impact of access patterns and network connectivity in benchmarks. Our case studies of graph-processing and Monte-Carlo applications evaluated the impact of application characteristics and local memory capacity and highlighted the potential of throughput scaling on disaggregated memory.

大规模高性能计算(HPC)系统由节点上紧密耦合的大量计算资源和内存资源组成。我们对四个生产HPC集群的内存利用率进行了大规模的研究。我们的结果表明，超过90%的作业使用不到15%的节点内存容量，并且在90%的时间内，内存利用率低于35%。最近，分解体系结构越来越受欢迎，因为它可以选择性地扩展资源并提高资源利用率。基于这些观察，我们探索使用分解内存来支持内存密集型应用程序，而大多数作业在节点内存减少的HPC系统上保持不变。我们设计并开发了一个用户空间远程内存分页库，使应用程序能够探索现有HPC集群上的分解内存。我们在基准测试中量化了访问模式和网络连接的影响。我们对图形处理和蒙特卡罗应用程序的案例研究评估了应用程序特征和本地内存容量的影响，并强调了分解内存上吞吐量扩展的潜力。

{"title":"On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems","authors":"I. Peng, R. Pearce, M. Gokhale","doi":"10.1109/SBAC-PAD49847.2020.00034","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00034","url":null,"abstract":"Large-scale high-performance computing (HPC) systems consist of massive compute and memory resources tightly coupled in nodes. We perform a large-scale study of memory utilization on four production HPC clusters. Our results show that more than 90% of jobs utilize less than 15% of the node memory capacity, and for 90% of the time, memory utilization is less than 35%. Recently, disaggregated architecture is gaining traction because it can selectively scale up a resource and improve resource utilization. Based on these observations, we explore using disaggregated memory to support memory-intensive applications, while most jobs remain intact on HPC systems with reduced node memory. We designed and developed a user-space remote-memory paging library to enable applications exploring disaggregated memory on existing HPC clusters. We quantified the impact of access patterns and network connectivity in benchmarks. Our case studies of graph-processing and Monte-Carlo applications evaluated the impact of application characteristics and local memory capacity and highlighted the potential of throughput scaling on disaggregated memory.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"48 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127652518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Optically Connected Memory for Disaggregated Data Centers 用于分散数据中心的光连接存储器

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-08-25 DOI: 10.1109/SBAC-PAD49847.2020.00017

Jorge González, A. Gazman, Maarten Hattink, Mauricio G. Palma, M. Bahadori, Ruth E. Rubio-Noriega, Lois Orosa, M. Glick, O. Mutlu, K. Bergman, R. Azevedo

Recent advances in integrated photonics enable the implementation of reconfigurable, high-bandwidth, and low energy-per-bit interconnects in next-generation data centers. We propose and evaluate an Optically Connected Memory (OCM) architecture that disaggregates the main memory from the computation nodes in data centers. OCM is based on micro-ring resonators (MRRs), and it does not require any modification to the DRAM memory modules. We calculate energy consumption from real photonic devices and integrate them into a system simulator to evaluate performance. Our results show that (1) OCM is capable of interconnecting four DDR4 memory channels to a computing node using two fibers with 1.07 pJ energy-per-bit consumption and (2) OCM performs up to 5.5x faster than a disaggregated memory with 40G PCIe NIC connectors to computing nodes.

集成光子学的最新进展使得在下一代数据中心中实现可重构、高带宽和低能耗比特互连成为可能。我们提出并评估了一种光连接内存（OCM）架构，该架构将数据中心的主内存与计算节点分离开来。OCM 基于微环谐振器 (MRR)，无需对 DRAM 内存模块进行任何修改。我们计算了实际光子设备的能耗，并将其集成到系统模拟器中以评估性能。我们的结果表明：(1) OCM 能够使用两根光纤将四个 DDR4 内存通道互连到计算节点，每比特能耗为 1.07 pJ；(2) OCM 的性能比使用 40G PCIe NIC 连接器的分解内存快 5.5 倍。

引用次数: 10

Scheduling Methods to Reduce Response Latency of Function as a Service 减少功能即服务响应延迟的调度方法

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-08-11 DOI: 10.1109/SBAC-PAD49847.2020.00028

P. Żuk, K. Rządca

Function as a Service (FaaS) permits cloud customers to deploy to cloud individual functions, in contrast to complete virtual machines or Linux containers. All major cloud providers offer FaaS products (Amazon Lambda, Google Cloud Functions, Azure Serverless); there are also popular open-source implementations (Apache OpenWhisk) with commercial offerings (Adobe I/O Runtime, IBM Cloud Functions). A new feature of FaaS is function composition: a function may (sequentially) call another function, which, in turn, may call yet another function - forming a chain of invocations. From the perspective of the infrastructure, a composed FaaS is less opaque than a virtual machine or a container. We show that this additional information enables the infrastructure to reduce the response latency. In particular, knowing the sequence of future invocations, the infrastructure can schedule these invocations along with environment preparation. We model resource management in FaaS as a scheduling problem combining (1) sequencing of invocations, (2) deploying execution environments on machines, and (3) allocating invocations to deployed environments. For each aspect, we propose heuristics. We explore their performance by simulation on a range of synthetic workloads. Our results show that if the setup times are long compared to invocation times, algorithms that use information about the composition of functions consistently outperform greedy, myopic algorithms, leading to significant decrease in response latency.

功能即服务(FaaS)允许云客户部署到云上的单个功能，而不是完整的虚拟机或Linux容器。所有主要的云提供商都提供FaaS产品(Amazon Lambda、Google cloud Functions、Azure Serverless);也有流行的开源实现(Apache OpenWhisk)和商业产品(Adobe I/O Runtime, IBM Cloud Functions)。FaaS的一个新特性是函数组合:一个函数可以(顺序地)调用另一个函数，而另一个函数又可以调用另一个函数——形成调用链。从基础设施的角度来看，组合FaaS比虚拟机或容器更透明。我们展示了这些附加信息使基础设施能够减少响应延迟。特别是，在了解了未来调用的顺序之后，基础设施可以将这些调用与环境准备一起调度。我们将FaaS中的资源管理建模为一个调度问题，该问题结合了(1)调用排序，(2)在机器上部署执行环境，以及(3)将调用分配到已部署环境。对于每个方面，我们都提出了启发式方法。我们通过模拟一系列合成工作负载来探索它们的性能。我们的结果表明，如果设置时间比调用时间长，那么使用函数组成信息的算法始终优于贪婪、短视的算法，从而显著降低响应延迟。

{"title":"Scheduling Methods to Reduce Response Latency of Function as a Service","authors":"P. Żuk, K. Rządca","doi":"10.1109/SBAC-PAD49847.2020.00028","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00028","url":null,"abstract":"Function as a Service (FaaS) permits cloud customers to deploy to cloud individual functions, in contrast to complete virtual machines or Linux containers. All major cloud providers offer FaaS products (Amazon Lambda, Google Cloud Functions, Azure Serverless); there are also popular open-source implementations (Apache OpenWhisk) with commercial offerings (Adobe I/O Runtime, IBM Cloud Functions). A new feature of FaaS is function composition: a function may (sequentially) call another function, which, in turn, may call yet another function - forming a chain of invocations. From the perspective of the infrastructure, a composed FaaS is less opaque than a virtual machine or a container. We show that this additional information enables the infrastructure to reduce the response latency. In particular, knowing the sequence of future invocations, the infrastructure can schedule these invocations along with environment preparation. We model resource management in FaaS as a scheduling problem combining (1) sequencing of invocations, (2) deploying execution environments on machines, and (3) allocating invocations to deployed environments. For each aspect, we propose heuristics. We explore their performance by simulation on a range of synthetic workloads. Our results show that if the setup times are long compared to invocation times, algorithms that use information about the composition of functions consistently outperform greedy, myopic algorithms, leading to significant decrease in response latency.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124712831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems sputniPIC:一种用于多gpu系统的隐式单元内粒子代码

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-08-10 DOI: 10.1109/SBAC-PAD49847.2020.00030

Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, I. Peng, Artur Podobas, S. Markidis

Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.

等离子体的大规模模拟对于提高我们对聚变装置、空间和天体物理系统的理解至关重要。细胞内粒子(PIC)码在高性能计算系统中成功地模拟了许多等离子体现象。如今，旗舰级超级计算机在每个计算节点上配备多个gpu，以实现前所未有的高功率效率计算能力。PIC码需要新的算法设计和实现来利用这种加速平台。在这项工作中，我们设计并优化了一个三维隐式PIC代码，称为sputniPIC，用于在通用多gpu计算节点上运行。与基于cpu的领域分解相比，我们引入了一种粒子分解数据布局，在gpu上使用粒子批进行重叠通信和计算。sputniPIC还原生支持不同的精度表示，以在支持降低精度的硬件上实现加速。我们通过著名的GEM挑战验证sputniPIC，并提供性能分析。我们在三个多gpu平台上测试了sputniPIC，并报告了与sputniPIC CPU OpenMP版本性能相比，性能提高了200-800倍。我们表明，降低精度可以在三个平台上进一步提高45%到80%的性能。由于这些性能改进，在具有多个gpu的单个节点上，sputniPIC可以实现只有使用集群才能实现的大规模三维PIC模拟。

{"title":"sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems","authors":"Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, I. Peng, Artur Podobas, S. Markidis","doi":"10.1109/SBAC-PAD49847.2020.00030","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00030","url":null,"abstract":"Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

TASO: Time and Space Optimization for Memory-Constrained DNN Inference 记忆约束下深度神经网络推理的时间和空间优化

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-05-21 DOI: 10.1109/SBAC-PAD49847.2020.00036

Yuan Wen, Andrew Anderson, Valentin Radu, M. O’Boyle, David Gregg

Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate work space that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNetand SqueezeNet) on the ARM Cortex-A15 yields speedups of 8× compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2× while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.

卷积神经网络(cnn)用于许多嵌入式应用，从工业机器人和自动化系统到移动设备上的生物识别。最先进的分类通常是由大型网络实现的，这些网络在内存和能源预算受到严格限制的移动和嵌入式设备上运行的成本高得令人难以置信。我们提出了一种基于整数线性规划(ILP)的CNN模型的提前域特定优化方法，用于选择基本操作来实现卷积层。我们通过以下方式优化执行时间和内存消耗之间的权衡:1)通过选择数据布局和基本操作来实现每一层，试图最小化整个网络的执行时间;2)分配适当的工作空间，以反映每层内存占用的上限。这两种优化策略可用于使用C编译器在任何平台上运行任何CNN。我们在ARM Cortex-A15上对一系列流行的ImageNet神经架构(GoogleNet, AlexNet, VGG, ResNetand SqueezeNet)进行了评估，与基于贪婪算法的原元选择相比，速度提高了8倍，内存需求减少了2.2倍，而与只考虑推理时间的求解器相比，只牺牲了15%的推理时间。此外，我们的优化方法在内存和延迟权衡的Pareto边界上为不同的配置提供了一系列最优点，可以在任意系统约束下使用。

{"title":"TASO: Time and Space Optimization for Memory-Constrained DNN Inference","authors":"Yuan Wen, Andrew Anderson, Valentin Radu, M. O’Boyle, David Gregg","doi":"10.1109/SBAC-PAD49847.2020.00036","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00036","url":null,"abstract":"Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate work space that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNetand SqueezeNet) on the ARM Cortex-A15 yields speedups of 8× compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2× while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132770003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing AIR:基于异步迭代路由的轻量级高性能数据流引擎

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-01-01 DOI: 10.1109/SBAC-PAD49847.2020.00018

V. E. Venugopal, M. Theobald, Samira Chaychi, Amal Tawakuli

Distributed Stream Processing Engines (DSPEs) are currently among the most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on top of a common HPC workload manager such as SLURM. AIR implements a light-weight, dynamic sharding protocol (referred to as "Asynchronous Iterative Routing"), which facilitates a direct and asynchronous communication among all worker nodes and thereby completely avoids any additional communication overhead with a dedicated master node. With its unique design, AIR fills the gap between the prevalent scale-out (but Java-based) architectures like Apache Spark and Flink, on one hand, and recent scale-up (and C++ based) prototypes such as StreamBox and PiCo, on the other hand. Our experiments over various benchmark settings confirm that AIR performs as good as the best scale-up SPEs on a single-node setup, while it outperforms existing scale-out DSPEs in terms of processing latency and sustainable throughput by a factor of up to 15 in a distributed setting.

分布式流处理引擎(dspe)是当前数据管理中最新兴的主题之一，其应用范围从实时事件监控到处理复杂的数据流程序和大数据分析。在本文中，我们描述了我们的AIR引擎的架构，它是用c++从头开始设计的，使用消息传递接口(MPI)， pthreads用于多线程，并直接部署在常见的HPC工作负载管理器(如SLURM)之上。AIR实现了一种轻量级的动态分片协议(称为“异步迭代路由”)，它促进了所有工作节点之间的直接和异步通信，从而完全避免了与专用主节点的任何额外通信开销。凭借其独特的设计，AIR填补了流行的横向扩展(但基于java)架构(如Apache Spark和Flink)与最近的横向扩展(基于c++)原型(如StreamBox和PiCo)之间的空白。我们在各种基准测试设置上的实验证实，在单节点设置上，AIR的性能与最佳扩展spe一样好，而在分布式设置中，它在处理延迟和可持续吞吐量方面的性能优于现有的扩展spe，最高可达15倍。

{"title":"AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing","authors":"V. E. Venugopal, M. Theobald, Samira Chaychi, Amal Tawakuli","doi":"10.1109/SBAC-PAD49847.2020.00018","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00018","url":null,"abstract":"Distributed Stream Processing Engines (DSPEs) are currently among the most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on top of a common HPC workload manager such as SLURM. AIR implements a light-weight, dynamic sharding protocol (referred to as \"Asynchronous Iterative Routing\"), which facilitates a direct and asynchronous communication among all worker nodes and thereby completely avoids any additional communication overhead with a dedicated master node. With its unique design, AIR fills the gap between the prevalent scale-out (but Java-based) architectures like Apache Spark and Flink, on one hand, and recent scale-up (and C++ based) prototypes such as StreamBox and PiCo, on the other hand. Our experiments over various benchmark settings confirm that AIR performs as good as the best scale-up SPEs on a single-node setup, while it outperforms existing scale-out DSPEs in terms of processing latency and sustainable throughput by a factor of up to 15 in a distributed setting.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114462866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7