2020 IEEE International Symposium on Workload Characterization (IISWC)最新文献_第2页

Characterizing the Scale-Up Performance of Microservices using TeaStore 使用TeaStore描述微服务的扩展性能

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00014

Sriyash Caculo, K. Lahiri, Subramaniam Kalambur

Cloud-based applications architected using microservices are becoming increasingly common. While recent work has studied how to optimize the performance of these applications at the data-center level, comparatively little is known about how these services utilize end-server compute resources. Major advances have been made in recent years in terms of the compute density offered by cloud servers, thanks to the emergence of mainstream, high-core count CPU designs. Consequently, it has become equally important to understand the ability of microservices to “scale up” within a server and make effective use of available resources. This paper presents a study of a publicly available microservice based application on a state-of-the-art x86 server supporting 128 logical CPUs per socket. We highlight the significant performance opportunities that exist when the scaling properties of individual services and knowledge of the underlying processor topology are properly exploited. Using such techniques, we demonstrate a throughput uplift of 22% and a latency reduction of 18% over a performance-tuned baseline of our microservices workload. In addition, we describe how such microservice-based applications are distinct from workloads commonly used for designing general-purpose server processors. This paper presents a study of a publicly available microservice based application on a state-of-the-art x86 server supporting 128 logical CPUs per socket. We highlight the significant performance opportunities that exist when the scaling properties of individual services and knowledge of the underlying processor topology are properly exploited. Using such techniques, we demonstrate a throughput uplift of 22% and a latency reduction of 18% over a performance-tuned baseline of our microservices workload. In addition, we describe how such microservice-based applications are distinct from workloads commonly used for designing general-purpose server processors.

使用微服务架构的基于云的应用程序正变得越来越普遍。虽然最近的工作已经研究了如何在数据中心级别优化这些应用程序的性能，但相对而言，人们对这些服务如何利用终端服务器计算资源知之甚少。近年来，由于主流的、高核数CPU设计的出现，云服务器提供的计算密度取得了重大进展。因此，理解微服务在服务器内“扩展”和有效利用可用资源的能力变得同样重要。本文介绍了一项基于公共微服务的应用程序的研究，该应用程序基于最先进的x86服务器，每个套接字支持128个逻辑cpu。我们强调了当适当地利用单个服务的伸缩属性和底层处理器拓扑知识时，存在的重要性能机会。使用这些技术，我们演示了在微服务工作负载的性能调优基线上，吞吐量提升22%，延迟减少18%。此外，我们还描述了这种基于微服务的应用程序如何区别于通常用于设计通用服务器处理器的工作负载。本文介绍了一项基于公共微服务的应用程序的研究，该应用程序基于最先进的x86服务器，每个套接字支持128个逻辑cpu。我们强调了当适当地利用单个服务的伸缩属性和底层处理器拓扑知识时，存在的重要性能机会。使用这些技术，我们演示了在微服务工作负载的性能调优基线上，吞吐量提升22%，延迟减少18%。此外，我们还描述了这种基于微服务的应用程序如何区别于通常用于设计通用服务器处理器的工作负载。

{"title":"Characterizing the Scale-Up Performance of Microservices using TeaStore","authors":"Sriyash Caculo, K. Lahiri, Subramaniam Kalambur","doi":"10.1109/IISWC50251.2020.00014","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00014","url":null,"abstract":"Cloud-based applications architected using microservices are becoming increasingly common. While recent work has studied how to optimize the performance of these applications at the data-center level, comparatively little is known about how these services utilize end-server compute resources. Major advances have been made in recent years in terms of the compute density offered by cloud servers, thanks to the emergence of mainstream, high-core count CPU designs. Consequently, it has become equally important to understand the ability of microservices to “scale up” within a server and make effective use of available resources. This paper presents a study of a publicly available microservice based application on a state-of-the-art x86 server supporting 128 logical CPUs per socket. We highlight the significant performance opportunities that exist when the scaling properties of individual services and knowledge of the underlying processor topology are properly exploited. Using such techniques, we demonstrate a throughput uplift of 22% and a latency reduction of 18% over a performance-tuned baseline of our microservices workload. In addition, we describe how such microservice-based applications are distinct from workloads commonly used for designing general-purpose server processors. This paper presents a study of a publicly available microservice based application on a state-of-the-art x86 server supporting 128 logical CPUs per socket. We highlight the significant performance opportunities that exist when the scaling properties of individual services and knowledge of the underlying processor topology are properly exploited. Using such techniques, we demonstrate a throughput uplift of 22% and a latency reduction of 18% over a performance-tuned baseline of our microservices workload. In addition, we describe how such microservice-based applications are distinct from workloads commonly used for designing general-purpose server processors.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134199377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Scalable and Fast Lazy Persistency on GPUs gpu上的可伸缩和快速延迟持久性

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00032

Ardhi Wiratama Baskara Yudha, K. Kimura, Huiyang Zhou, Yan Solihin

GPUs applications, including many scientific and machine learning applications, increasingly demand larger memory capacity. NVM is promising higher density compared to DRAM and better future scaling potentials. Long running GPU applications can benefit from NVM by exploiting its persistency, allowing crash recovery of data in memory. In this paper, we propose mapping Lazy Persistency (LP) to GPUs and identify the design space of such mapping. We then characterize LP performance on GPUs, varying the checksum type, reduction method, use of locking, and hash table designs. Armed with insights into the performance bottlenecks, we propose a hash table-less method that performs well on hundreds and thousands of threads, achieving persistency with nearly negligible (2.1%) slowdown for a variety of representative benchmarks. We also propose a directive-based programming language support to simplify programming effort for adding LP to GPU applications.

gpu应用，包括许多科学和机器学习应用，越来越需要更大的内存容量。与DRAM相比，NVM具有更高的密度和更好的未来缩放潜力。长时间运行的GPU应用程序可以利用NVM的持久性，允许内存中的数据崩溃恢复，从而受益于NVM。本文提出将延迟持久性(LP)映射到gpu上，并确定了这种映射的设计空间。然后，我们描述了gpu上的LP性能，改变了校验和类型、缩减方法、锁的使用和哈希表设计。有了对性能瓶颈的深入了解，我们提出了一种无哈希表的方法，该方法在成百上千个线程上表现良好，在各种代表性基准测试中，以几乎可以忽略不计的速度(2.1%)实现持久性。我们还提出了一种基于指令的编程语言支持，以简化在GPU应用程序中添加LP的编程工作。

引用次数: 7

Selective Event Processing for Energy Efficient Mobile Gaming with SNIP 基于SNIP的高效节能移动游戏的选择性事件处理

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00035

Prasanna Venkatesh Rengasamy, Haibo Zhang, Shulin Zhao, A. Sivasubramaniam, M. Kandemir, C. Das

Gaming is an important class of workloads for mobile devices. They are not only one of the biggest markets for game developers and app stores, but also amongst the most stressful applications for the SoC. In these workloads, much of the computation is user-driven, i.e. events captured from sensors drive the computation to be performed. Consequently, event processing constitutes the bulk of energy drain for these applications. To address this problem, we conduct a detailed characterization of event processing activities in several popular games and show that (i) some of the events are exactly repetitive in their inputs, not requiring any processing at all; or (ii) a significant number of events are redundant in that even if the inputs for these events are different, the output matches events already processed. Memoization is one of the obvious choices to optimize such behavior, however the problem is a lot more challenging in this context because the computation can span even functional/OS boundaries, and the input space required for tables can takes gigabytes of storage. Instead, our Selecting Necessary InPuts (SNIP) software solution uses machine learning to isolate the input features that we really need to track in order to considerably shrink memoization tables. We show that SNIP can save up to 32% of the energy in these games without requiring any hardware modifications.

游戏是移动设备的一种重要工作负载。它们不仅是游戏开发者和应用商店的最大市场之一，也是SoC面临最大压力的应用之一。在这些工作负载中，许多计算是用户驱动的，即从传感器捕获的事件驱动要执行的计算。因此，事件处理构成了这些应用程序的大部分能量消耗。为了解决这个问题，我们对几款流行游戏中的事件处理活动进行了详细的描述，并表明:(1)一些事件的输入是重复的，根本不需要任何处理;或者(ii)大量事件是冗余的，因为即使这些事件的输入不同，输出也与已处理的事件相匹配。记忆是优化这种行为的明显选择之一，但是在这种情况下，问题更具挑战性，因为计算甚至可以跨越功能/操作系统边界，并且表所需的输入空间可能占用gb的存储空间。相反，我们的选择必要输入(SNIP)软件解决方案使用机器学习来隔离我们真正需要跟踪的输入特征，以便大大缩小记忆表。我们证明SNIP可以在不需要任何硬件修改的情况下在这些游戏中节省高达32%的能量。

{"title":"Selective Event Processing for Energy Efficient Mobile Gaming with SNIP","authors":"Prasanna Venkatesh Rengasamy, Haibo Zhang, Shulin Zhao, A. Sivasubramaniam, M. Kandemir, C. Das","doi":"10.1109/IISWC50251.2020.00035","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00035","url":null,"abstract":"Gaming is an important class of workloads for mobile devices. They are not only one of the biggest markets for game developers and app stores, but also amongst the most stressful applications for the SoC. In these workloads, much of the computation is user-driven, i.e. events captured from sensors drive the computation to be performed. Consequently, event processing constitutes the bulk of energy drain for these applications. To address this problem, we conduct a detailed characterization of event processing activities in several popular games and show that (i) some of the events are exactly repetitive in their inputs, not requiring any processing at all; or (ii) a significant number of events are redundant in that even if the inputs for these events are different, the output matches events already processed. Memoization is one of the obvious choices to optimize such behavior, however the problem is a lot more challenging in this context because the computation can span even functional/OS boundaries, and the input space required for tables can takes gigabytes of storage. Instead, our Selecting Necessary InPuts (SNIP) software solution uses machine learning to isolate the input features that we really need to track in order to considerably shrink memoization tables. We show that SNIP can save up to 32% of the energy in these games without requiring any hardware modifications.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129273466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Rigorous Benchmarking and Performance Analysis Methodology for Python Workloads Python工作负载的严格基准测试和性能分析方法

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00017

Arthur Crapé, L. Eeckhout

Computer architecture and computer systems research and development is heavily driven by benchmarking and performance analysis. It is thus of paramount importance that rigorous methodologies are used to draw correct conclusions and steer research and development in the right direction. While rigorous methodologies are widely used for native and managed programming language workloads, scripting language workloads are subject to ad-hoc methodologies which lead to incorrect and misleading conclusions. In particular, we find incorrect public statements regarding different virtual machines for Python, the most popular scripting language. The incorrect conclusion is a result of using the geometric mean speedup and not making a distinction between start-up and steady-state performance. In this paper, we propose a statistically rigorous benchmarking and performance analysis methodology for Python workloads, which makes a distinction between start-up and steady-state performance and which summarizes average performance across a set of benchmarks using the harmonic mean speedup. We find that a rigorous methodology makes a difference in practice. In particular, we find that the PyPy JIT compiler outperforms the CPython interpreter by 1.76 × for steady-state while being 2% slower for start-up, which refutes the statement on the PyPy website that ‘PyPy outperforms CPython by 4.4× on average’ based on the geometric mean speedup and not making a distinction between start-up and steady-state. We use the proposed methodology to analyze Python workloads which yields several interesting findings regarding PyPy versus CPython performance, start-up versus steady-state performance, the impact of a workload's input size, and Python workload execution characteristics at the microarchitecture level.

计算机体系结构和计算机系统的研究和发展在很大程度上是由基准测试和性能分析驱动的。因此，使用严谨的方法得出正确的结论并引导研究和发展朝着正确的方向发展是至关重要的。虽然严格的方法被广泛用于本机和托管编程语言工作负载，但脚本语言工作负载受制于特别的方法，从而导致不正确和误导性的结论。特别是，我们发现了关于Python(最流行的脚本语言)的不同虚拟机的不正确的公开声明。不正确的结论是使用几何平均加速而没有区分启动和稳态性能的结果。在本文中，我们为Python工作负载提出了一种统计上严格的基准测试和性能分析方法，该方法区分了启动和稳态性能，并使用谐波平均加速总结了一组基准测试的平均性能。我们发现，严格的方法论在实践中起着重要作用。特别是，我们发现PyPy JIT编译器在稳定状态下比CPython解释器性能高1.76倍，而在启动时比CPython解释器慢2%，这驳斥了PyPy网站上基于几何平均加速而没有区分启动和稳定状态的“PyPy平均比CPython性能高4.4倍”的说法。我们使用提出的方法来分析Python工作负载，得出了几个有趣的发现，包括PyPy与CPython性能、启动与稳态性能、工作负载输入大小的影响，以及微架构级别的Python工作负载执行特征。

{"title":"A Rigorous Benchmarking and Performance Analysis Methodology for Python Workloads","authors":"Arthur Crapé, L. Eeckhout","doi":"10.1109/IISWC50251.2020.00017","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00017","url":null,"abstract":"Computer architecture and computer systems research and development is heavily driven by benchmarking and performance analysis. It is thus of paramount importance that rigorous methodologies are used to draw correct conclusions and steer research and development in the right direction. While rigorous methodologies are widely used for native and managed programming language workloads, scripting language workloads are subject to ad-hoc methodologies which lead to incorrect and misleading conclusions. In particular, we find incorrect public statements regarding different virtual machines for Python, the most popular scripting language. The incorrect conclusion is a result of using the geometric mean speedup and not making a distinction between start-up and steady-state performance. In this paper, we propose a statistically rigorous benchmarking and performance analysis methodology for Python workloads, which makes a distinction between start-up and steady-state performance and which summarizes average performance across a set of benchmarks using the harmonic mean speedup. We find that a rigorous methodology makes a difference in practice. In particular, we find that the PyPy JIT compiler outperforms the CPython interpreter by 1.76 × for steady-state while being 2% slower for start-up, which refutes the statement on the PyPy website that ‘PyPy outperforms CPython by 4.4× on average’ based on the geometric mean speedup and not making a distinction between start-up and steady-state. We use the proposed methodology to analyze Python workloads which yields several interesting findings regarding PyPy versus CPython performance, start-up versus steady-state performance, the impact of a workload's input size, and Python workload execution characteristics at the microarchitecture level.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128745903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Demystifying Power and Performance Bottlenecks in Autonomous Driving Systems 揭开自动驾驶系统的动力和性能瓶颈之谜

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00028

P. H. E. Becker, J. Arnau, Antonio González

Autonomous Vehicles (AVs) have the potential to radically change the automotive industry. However, computing solutions for AVs have to meet severe performance and power constraints to guarantee a safe driving experience. Current solutions either exhibit high cost and power dissipation or fail to meet the stringent latency constraints. Therefore, the popularization of AVs requires a low-cost yet effective computing system. Understanding the sources of latency and energy consumption is key in order to improve autonomous driving systems. In this paper, we present a detailed characterization of Autoware, a modern self-driving car system. We analyze the performance and power of the different components and leverage hardware counters to identify the main bottlenecks. Our approach to AV characterization avoids pitfalls of previous works: profiling individual components in isolation and neglecting LiDAR-related components. We base our characterization on a rigorous methodology that considers the entire software stack. Profiling the end-to-end system accounts for interference and contention among different components that run in parallel, also including memory transfers to communicate data. We show that all these factors have a high impact on latency and cannot be measured by profiling isolated modules. Our characterization provides novel insights, some of the interesting findings are the following. First, contention among different modules drastically impacts latency and performance predictability. Second, LiDAR-related components are important contributors to the latency of the system. Finally, a modern platform with a high-end CPU and GPU cannot achieve real-time performance when considering the entire end-to-end system.

自动驾驶汽车(AVs)有可能彻底改变汽车行业。然而，自动驾驶汽车的计算解决方案必须满足严格的性能和功率限制，以保证安全的驾驶体验。当前的解决方案要么表现出高成本和功耗，要么无法满足严格的延迟限制。因此，自动驾驶汽车的普及需要一个低成本、高效的计算系统。了解延迟和能耗的来源是改进自动驾驶系统的关键。在本文中，我们详细描述了Autoware，一个现代自动驾驶汽车系统。我们分析了不同组件的性能和功耗，并利用硬件计数器来识别主要瓶颈。我们的自动驾驶表征方法避免了以前工作的陷阱:孤立地分析单个组件并忽略与激光雷达相关的组件。我们将我们的描述建立在考虑整个软件堆栈的严格方法之上。对端到端系统的分析考虑了并行运行的不同组件之间的干扰和争用，还包括用于通信数据的内存传输。我们表明，所有这些因素都对延迟有很大的影响，并且不能通过分析孤立的模块来测量。我们的描述提供了新颖的见解，下面是一些有趣的发现。首先，不同模块之间的争用会极大地影响延迟和性能可预测性。其次，激光雷达相关组件是影响系统延迟的重要因素。最后，考虑到整个端到端系统，拥有高端CPU和GPU的现代平台无法实现实时性能。

{"title":"Demystifying Power and Performance Bottlenecks in Autonomous Driving Systems","authors":"P. H. E. Becker, J. Arnau, Antonio González","doi":"10.1109/IISWC50251.2020.00028","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00028","url":null,"abstract":"Autonomous Vehicles (AVs) have the potential to radically change the automotive industry. However, computing solutions for AVs have to meet severe performance and power constraints to guarantee a safe driving experience. Current solutions either exhibit high cost and power dissipation or fail to meet the stringent latency constraints. Therefore, the popularization of AVs requires a low-cost yet effective computing system. Understanding the sources of latency and energy consumption is key in order to improve autonomous driving systems. In this paper, we present a detailed characterization of Autoware, a modern self-driving car system. We analyze the performance and power of the different components and leverage hardware counters to identify the main bottlenecks. Our approach to AV characterization avoids pitfalls of previous works: profiling individual components in isolation and neglecting LiDAR-related components. We base our characterization on a rigorous methodology that considers the entire software stack. Profiling the end-to-end system accounts for interference and contention among different components that run in parallel, also including memory transfers to communicate data. We show that all these factors have a high impact on latency and cannot be measured by profiling isolated modules. Our characterization provides novel insights, some of the interesting findings are the following. First, contention among different modules drastically impacts latency and performance predictability. Second, LiDAR-related components are important contributors to the latency of the system. Finally, a modern platform with a high-end CPU and GPU cannot achieve real-time performance when considering the entire end-to-end system.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126747030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Port or Shim? Stress Testing Application Performance on Intel SGX 港口还是港口?在Intel SGX上压力测试应用程序性能

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00021

Aisha Hasan, Ryan D. Riley, D. Ponomarev

Intel's newer processors come equipped with Software Guard Extensions (SGX) technology, allowing developers to write sections of code that run in a protected area of memory known as an enclave. In this work, we compare performance of two scenarios for running existing code on SGX. In one, a developer manually ports the code to SGX. In the other, a shim-layer and library OS are used to run the code unmodified on SGX. Our initial results demonstrate that when running an existing benchmarking tool under SGX, in addition to being much faster for development, code running in the library OS also tends to run at the same speed or faster than code that is manually ported. After obtaining this result, we then go on to design a series of microbenchmarks to characterize exactly what types of workloads would benefit from manual porting. We find that if the application to be ported has a small sensitive working set (less than the 6MB available cache size of the CPU), infrequently needs to enter the enclave (less than 110,000 times per second), and spends most of its time working on data outside of the enclave, then it may indeed perform better if it is manually ported as opposed to run in a shim.

英特尔的新处理器配备了软件保护扩展(SGX)技术，允许开发人员编写在内存的受保护区域(称为enclave)中运行的代码段。在这项工作中，我们比较了在SGX上运行现有代码的两种场景的性能。在一种情况下，开发人员手动将代码移植到SGX。在另一种情况下，使用shim层和库操作系统在SGX上不加修改地运行代码。我们的初步结果表明，在SGX下运行现有的基准测试工具时，除了开发速度快得多之外，在库操作系统中运行的代码也倾向于以与手动移植的代码相同或更快的速度运行。获得此结果后，我们将继续设计一系列微基准测试，以准确地描述哪些类型的工作负载将从手动移植中受益。我们发现，如果要移植的应用程序具有较小的敏感工作集(小于CPU的6MB可用缓存大小)，很少需要进入enclave(每秒少于110,000次)，并且将大部分时间用于处理enclave之外的数据，那么与在shim中运行相比，手动移植可能确实性能更好。

{"title":"Port or Shim? Stress Testing Application Performance on Intel SGX","authors":"Aisha Hasan, Ryan D. Riley, D. Ponomarev","doi":"10.1109/IISWC50251.2020.00021","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00021","url":null,"abstract":"Intel's newer processors come equipped with Software Guard Extensions (SGX) technology, allowing developers to write sections of code that run in a protected area of memory known as an enclave. In this work, we compare performance of two scenarios for running existing code on SGX. In one, a developer manually ports the code to SGX. In the other, a shim-layer and library OS are used to run the code unmodified on SGX. Our initial results demonstrate that when running an existing benchmarking tool under SGX, in addition to being much faster for development, code running in the library OS also tends to run at the same speed or faster than code that is manually ported. After obtaining this result, we then go on to design a series of microbenchmarks to characterize exactly what types of workloads would benefit from manual porting. We find that if the application to be ported has a small sensitive working set (less than the 6MB available cache size of the CPU), infrequently needs to enter the enclave (less than 110,000 times per second), and spends most of its time working on data outside of the enclave, then it may indeed perform better if it is manually ported as opposed to run in a shim.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130012444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs gpu上可自引导同态加密的加速数论变换

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00033

Sangpyo Kim, Wonkyung Jung, J. Park, Jung Ho Ahn

Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2⨯ speedup on a modern GPU.

同态加密(HE)为加密消息提供了一种保护隐私的计算方法，引起了人们的广泛关注。数论变换(NTT)是离散傅立叶变换(DFT)在有限整数域中的一种特殊形式，是实现HE中加密密文快速计算的关键算法。先前的工作通过利用DFT优化技术，加速了NTT及其在流行的并行处理平台GPU上的逆变换。然而，这些基于gpu的研究缺乏对NTT和DFT之间主要差异的全面分析，或者只考虑在没有解密的情况下可以执行的算术运算数量有严格限制的小HE参数。在本文中，我们分析了NTT和DFT的算法特征，并在现代gpu上应用通常适用于DFT和NTT的优化时评估了NTT的性能。从分析中，我们发现NTT在大HE参数集上存在严重的主存带宽瓶颈。为了解决主存带宽问题，我们提出了一种新的ntt特定的动态根生成方案，称为动态捻动(OT)。与基线基数-2 NTT实现相比，在应用所有优化(包括OT)后，我们在现代GPU上实现了4.2加速。

{"title":"Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs","authors":"Sangpyo Kim, Wonkyung Jung, J. Park, Jung Ho Ahn","doi":"10.1109/IISWC50251.2020.00033","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00033","url":null,"abstract":"Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2⨯ speedup on a modern GPU.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132975844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

A Sparse Tensor Benchmark Suite for CPUs and GPUs 稀疏张量基准套件的cpu和gpu

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00027

Jiajia Li, M. Lakshminarasimhan, Xiaolong Wu, Ang Li, C. Olschanowsky, K. Barker

Tensor computations present significant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. Efforts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of reference tensor kernel implementations that are compatible with real-world tensors and power law tensors extended from synthetic graph generation techniques. We also propose Roofline performance models for these kernels to provide insights of computer platforms from sparse tensor view. This benchmark suite along with the synthetic tensor generator is publicly available.

张量计算带来了重大的性能挑战，影响了从机器学习、医疗保健分析、社交网络分析、数据挖掘到量子化学和信号处理等广泛的应用。改进张量计算性能的努力包括探索数据布局、执行调度和通用张量核中的并行性。这项工作提出了一个使用最先进的张量格式的任意阶稀疏张量核的基准套件:cpu和gpu上的坐标(COO)和分层坐标(HiCOO)。它提出了一组参考张量核实现，这些实现与现实世界的张量和从合成图生成技术扩展出来的幂律张量兼容。我们还提出了这些核的rooline性能模型，以提供从稀疏张量视图的计算机平台的见解。这个基准套件以及合成张量生成器是公开可用的。

引用次数: 2

Pocolo: Power Optimized Colocation in Power Constrained Environments Pocolo:功率受限环境下的功率优化主机托管

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00010

Iyswarya Narayanan, Adithya Kumar, A. Sivasubramaniam

There is a considerable amount of prior effort on co-locating applications on datacenter servers for boosting resource utilization. However, we note that it is equally important to take power into consideration from the co-location viewpoint. Applications can still interfere on power in stringent power constrained infrastructures, despite no direct resource contention between the coexisting applications. This becomes particularly important with dynamic load variations, where even if the power capacity is tuned for the peak load of an application, co-locating another application with it during its off-period can lead to overshooting of the power capacity. Therefore, to extract maximum returns on datacenter infrastructure investments one needs to jointly handle power and server resources. We explore this problem in the context of a private-cloud cluster which is provisioned for a primary latency-critical application, but also admits secondary best-effort applications to improve utilization during off-peak periods. Our solution, Pocolo, draws on principles from economics to reason about resource demands in power constrained environments and provides answers to the when/where/what questions pertaining to co-location. We implement Pocolo on a Linux cluster to demonstrate its performance and cost benefits over a number of latency-sensitive and best-effort datacenter workloads.

为了提高资源利用率，之前在数据中心服务器上共同定位应用程序方面做了大量的工作。然而，我们注意到，从共址的角度考虑功率也同样重要。尽管共存的应用程序之间没有直接的资源争用，但在严格的电力约束基础设施中，应用程序仍然可能干扰电力。对于动态负载变化，这一点变得尤为重要，即使针对应用程序的峰值负载调整了功率容量，但在其停机期间将另一个应用程序与它放在一起也可能导致功率容量过调。因此，为了获得数据中心基础设施投资的最大回报，需要联合处理电源和服务器资源。我们在私有云集群的上下文中探讨了这个问题，私有云集群是为主要的延迟关键型应用程序提供的，但也允许次要的尽最大努力的应用程序来提高非高峰期间的利用率。我们的解决方案Pocolo借鉴了经济学原理来解释电力受限环境中的资源需求，并提供了与托管相关的时间/地点/内容问题的答案。我们在Linux集群上实现Pocolo，以演示它在许多延迟敏感和尽力而为的数据中心工作负载上的性能和成本优势。

{"title":"Pocolo: Power Optimized Colocation in Power Constrained Environments","authors":"Iyswarya Narayanan, Adithya Kumar, A. Sivasubramaniam","doi":"10.1109/IISWC50251.2020.00010","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00010","url":null,"abstract":"There is a considerable amount of prior effort on co-locating applications on datacenter servers for boosting resource utilization. However, we note that it is equally important to take power into consideration from the co-location viewpoint. Applications can still interfere on power in stringent power constrained infrastructures, despite no direct resource contention between the coexisting applications. This becomes particularly important with dynamic load variations, where even if the power capacity is tuned for the peak load of an application, co-locating another application with it during its off-period can lead to overshooting of the power capacity. Therefore, to extract maximum returns on datacenter infrastructure investments one needs to jointly handle power and server resources. We explore this problem in the context of a private-cloud cluster which is provisioned for a primary latency-critical application, but also admits secondary best-effort applications to improve utilization during off-peak periods. Our solution, Pocolo, draws on principles from economics to reason about resource demands in power constrained environments and provides answers to the when/where/what questions pertaining to co-location. We implement Pocolo on a Linux cluster to demonstrate its performance and cost benefits over a number of latency-sensitive and best-effort datacenter workloads.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122020519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Study of APIs for Graph Analytics Workloads 图形分析工作负载的api研究

2020 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00030

Hochan Lee, D. Wong, Loc Hoang, Roshan Dathathri, G. Gill, Vishwesh Jatala, D. Kuck, K. Pingali

Traditionally, parallel graph analytics workloads have been implemented in systems like Pregel, GraphLab, Galois, and Ligra that support graph data structures and graph operations directly. An alternative approach is to express graph workloads in terms of sparse matrix kernels such as sparse matrix-vector and matrix-matrix multiplication. An API for these kernels has been defined by the GraphBLAS project. The SuiteSparse project has implemented this API on shared-memory platforms, and the LAGraph project is building a library of graph algorithms using this API. How does the matrix-based approach perform compared to the graph-based approach? Our experiments on a 56 core CPU show that for representative graph workloads, LAGraph/SuiteSparse solutions are 5x slower on the average than Galois solutions. We argue that this performance gap arises from inherent limitations of a matrix-based API: regardless of which architecture a matrix-based algorithm is run on, it is subject to the same inherent limitations of the matrix-based API.

传统上，并行图分析工作负载已经在像Pregel、GraphLab、Galois和Ligra这样的系统中实现，这些系统直接支持图数据结构和图操作。另一种方法是用稀疏矩阵核(如稀疏矩阵-向量和矩阵-矩阵乘法)来表示图工作负载。GraphBLAS项目已经为这些内核定义了一个API。SuiteSparse项目已经在共享内存平台上实现了这个API, laggraph项目正在使用这个API构建一个图算法库。与基于图的方法相比，基于矩阵的方法的性能如何?我们在56核CPU上的实验表明，对于代表性的图形工作负载，graph /SuiteSparse解决方案平均比Galois解决方案慢5倍。我们认为，这种性能差距源于基于矩阵的API的固有限制:无论基于矩阵的算法在哪个架构上运行，它都受到基于矩阵的API的相同固有限制。

引用次数: 1