Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering最新文献_第3页

Towards Structured Performance Analysis of Industry 4.0 Workflow Automation Resources 面向工业4.0工作流自动化资源的结构化性能分析

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309671

A. Kattepur

Automation and the use of robotic components within business processes is in vogue across retail and manufacturing industries. However, a structured way of analyzing performance improvements provided by automation in complex workflows is still at a nascent stage. In this paper, we consider the common Industry 4.0 automation workflow resource patterns and model them within a hybrid queuing network. The queuing stations are replaced by scale up, scale out and hybrid scale automation patterns, to examine improvements in end-to-end process performance. We exhaustively simulate the throughput, response time, utilization and operating costs at higher concurrencies using Mean Value Analysis (MVA) algorithms. The queues are analyzed for cases with multiple classes, batch/transactional processing and load dependent service demands. These solutions are demonstrated over an exemplar use case of automation in Industry 4.0 warehouse automation workflows. A structured process of automation workflow performance analysis will prove valuable across industrial deployments.

自动化和在业务流程中使用机器人组件在零售和制造业中非常流行。然而，分析复杂工作流中自动化所提供的性能改进的结构化方法仍处于初级阶段。在本文中，我们考虑了常见的工业4.0自动化工作流资源模式，并在混合排队网络中对它们进行了建模。排队站被向上扩展、向外扩展和混合扩展自动化模式所取代，以检查端到端流程性能的改进。我们使用均值分析(Mean Value Analysis, MVA)算法详尽地模拟了更高并发下的吞吐量、响应时间、利用率和操作成本。针对具有多个类、批处理/事务处理和负载相关服务需求的情况，对队列进行分析。这些解决方案通过工业4.0仓库自动化工作流中的自动化示例用例进行了演示。自动化工作流性能分析的结构化过程将在整个工业部署中证明是有价值的。

引用次数: 5

Practical Reliability Analysis of GPGPUs in the Wild: From Systems to Applications 野外gpgpu的实用可靠性分析:从系统到应用

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310291

E. Smirni

General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.

通用图形处理单元(gpgpu)已经迅速发展到能够为广泛的科学领域实现节能的数据并行计算。虽然gpu在严格的功率预算下实现了百亿亿级的性能，但它们也容易受到软错误(故障)的影响，通常是由高能粒子撞击引起的，这可能会严重影响应用程序的输出质量。由于这些应用程序通常是长时间运行的，因此调查GPU错误的特征对于更好地理解此类系统的可靠性至关重要。在这次演讲中，我将展示一项触发GPU软错误的系统条件的研究，使用从橡树岭国家实验室的大规模运行HPC系统收集的六个月跟踪数据。工作负载特征、某些GPU卡、温度和功耗可能是GPU故障的指示，但利用它们进行错误预测是很重要的。在这些观察和挑战的激励下，我将展示基于机器学习的错误预测模型如何捕获系统和工作负载属性之间隐藏的交互。上面的发现引出了一个问题:如果错误一定会发生，人们如何更好地理解应用程序的弹性?为此，我将说明GPGPU应用程序中全面故障注入的挑战，并概述一种新的故障注入解决方案，该解决方案可以捕获GPGPU应用程序的错误恢复概况。

{"title":"Practical Reliability Analysis of GPGPUs in the Wild: From Systems to Applications","authors":"E. Smirni","doi":"10.1145/3297663.3310291","DOIUrl":"https://doi.org/10.1145/3297663.3310291","url":null,"abstract":"General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126932589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bottleneck Identification and Performance Modeling of OPC UA Communication Models OPC UA通信模型的瓶颈识别与性能建模

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309670

Andreas Burger, H. Koziolek, Julius Rückert, Marie Platenius-Mohr, G. Stomberg

The OPC UA communication architecture is currently becoming an integral part of industrial automation systems, which control complex production processes, such as electric power generation or paper production. With a recently released extension for pub/sub communication, OPC UA can now also support fast cyclic control applications, but the bottlenecks of OPC UA implementations and their scalability on resource-constrained industrial devices are not yet well understood. Former OPC UA performance evaluations mainly concerned client/server round-trip times or focused on jitter, but did not explore resource bottlenecks or create predictive performance models. We have carried out extensive performance measurements with OPC UA client/server and pub/sub communication and created a CPU utilization prediction model based on linear regression that can be used to size hardware environments. We found that the server CPU is the main bottleneck for OPC UA pub/sub communication, but allows a throughput of up to 40,000 signals per second on a Raspberry Pi Zero. We also found that the client/server session management overhead can severely impact performance, if more than 20 clients access a single server.

OPC UA通信架构目前正在成为工业自动化系统的一个组成部分，用于控制复杂的生产过程，如发电或造纸。随着最近发布的pub/sub通信扩展，OPC UA现在也可以支持快速循环控制应用程序，但是OPC UA实现的瓶颈及其在资源受限的工业设备上的可扩展性尚未得到很好的理解。以前的OPC UA性能评估主要关注客户端/服务器往返时间或关注抖动，但没有探索资源瓶颈或创建预测性能模型。我们使用OPC UA客户机/服务器和pub/sub通信进行了广泛的性能测量，并创建了基于线性回归的CPU利用率预测模型，该模型可用于调整硬件环境。我们发现服务器CPU是OPC UA pub/sub通信的主要瓶颈，但在Raspberry Pi Zero上允许每秒高达40,000个信号的吞吐量。我们还发现，如果超过20个客户机访问单个服务器，客户机/服务器会话管理开销会严重影响性能。

{"title":"Bottleneck Identification and Performance Modeling of OPC UA Communication Models","authors":"Andreas Burger, H. Koziolek, Julius Rückert, Marie Platenius-Mohr, G. Stomberg","doi":"10.1145/3297663.3309670","DOIUrl":"https://doi.org/10.1145/3297663.3309670","url":null,"abstract":"The OPC UA communication architecture is currently becoming an integral part of industrial automation systems, which control complex production processes, such as electric power generation or paper production. With a recently released extension for pub/sub communication, OPC UA can now also support fast cyclic control applications, but the bottlenecks of OPC UA implementations and their scalability on resource-constrained industrial devices are not yet well understood. Former OPC UA performance evaluations mainly concerned client/server round-trip times or focused on jitter, but did not explore resource bottlenecks or create predictive performance models. We have carried out extensive performance measurements with OPC UA client/server and pub/sub communication and created a CPU utilization prediction model based on linear regression that can be used to size hardware environments. We found that the server CPU is the main bottleneck for OPC UA pub/sub communication, but allows a throughput of up to 40,000 signals per second on a Raspberry Pi Zero. We also found that the client/server session management overhead can severely impact performance, if more than 20 clients access a single server.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130991192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Measuring the Energy Efficiency of Transactional Loads on GPGPU GPGPU事务负载的能效测量

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309667

J. V. Kistowski, Johann Pais, T. Wahl, K. Lange, Hansfried Block, John Beckett, Samuel Kounev

General Purpose Graphics Processing Units (GPGPUs) are becoming more and more common in current servers and data centers, which in turn consume a significant amount of electrical power. Measuring and benchmarking this power consumption is important as it helps with optimization and selection of these servers. However, benchmarking and comparing the energy efficiency of GPGPU workloads is challenging as standardized workloads are rare and standardized power and efficiency measurement methods and metrics do not exist. In addition, not all GPGPU systems run at maximum load all the time. Systems that are utilized in transactional, request driven workloads, for example, can run at lower utilization levels. Existing benchmarks for GPGPU systems primarily consider performance and are intended only to run at maximum load. They do not measure performance or energy efficiency at other loads. In turn, server energy-efficiency benchmarks that consider multiple load levels do not address GPGPUs. This paper introduces a measurement methodology for servers with GPGPU accelerators that considers multiple load levels for transactional workloads. The methodology also addresses verifiability of results in order to achieve comparability of different device solutions. We analyze our methodology on three different systems with solutions from two different accelerator vendors. We investigate the efficacy of different methods of load levels scaling and our methodology's reproducibility. We show that the methodology is able to produce consistent and reproducible results with a maximum coefficient of variation of 1.4% regarding power consumption.

通用图形处理单元(gpgpu)在当前的服务器和数据中心中变得越来越普遍，这反过来又消耗了大量的电力。测量和基准测试这种功耗非常重要，因为它有助于优化和选择这些服务器。然而，对GPGPU工作负载的能效进行基准测试和比较具有挑战性，因为标准化的工作负载很少，标准化的功耗和效率测量方法和指标也不存在。此外，并非所有GPGPU系统都一直以最大负载运行。例如，在事务性、请求驱动的工作负载中使用的系统可以在较低的利用率水平上运行。GPGPU系统的现有基准测试主要考虑性能，并且只打算在最大负载下运行。它们不衡量其他负载下的性能或能源效率。反过来，考虑多个负载级别的服务器能效基准不会处理gpgpu。本文介绍了一种用于具有GPGPU加速器的服务器的测量方法，该方法考虑了事务性工作负载的多个负载级别。该方法还解决了结果的可验证性，以实现不同设备解决方案的可比性。我们分析了我们的方法在三个不同的系统与解决方案，从两个不同的加速器供应商。我们研究了不同的负载水平缩放方法的有效性和我们的方法的可重复性。我们表明，该方法能够产生一致和可重复的结果，关于功耗的最大变异系数为1.4%。

{"title":"Measuring the Energy Efficiency of Transactional Loads on GPGPU","authors":"J. V. Kistowski, Johann Pais, T. Wahl, K. Lange, Hansfried Block, John Beckett, Samuel Kounev","doi":"10.1145/3297663.3309667","DOIUrl":"https://doi.org/10.1145/3297663.3309667","url":null,"abstract":"General Purpose Graphics Processing Units (GPGPUs) are becoming more and more common in current servers and data centers, which in turn consume a significant amount of electrical power. Measuring and benchmarking this power consumption is important as it helps with optimization and selection of these servers. However, benchmarking and comparing the energy efficiency of GPGPU workloads is challenging as standardized workloads are rare and standardized power and efficiency measurement methods and metrics do not exist. In addition, not all GPGPU systems run at maximum load all the time. Systems that are utilized in transactional, request driven workloads, for example, can run at lower utilization levels. Existing benchmarks for GPGPU systems primarily consider performance and are intended only to run at maximum load. They do not measure performance or energy efficiency at other loads. In turn, server energy-efficiency benchmarks that consider multiple load levels do not address GPGPUs. This paper introduces a measurement methodology for servers with GPGPU accelerators that considers multiple load levels for transactional workloads. The methodology also addresses verifiability of results in order to achieve comparability of different device solutions. We analyze our methodology on three different systems with solutions from two different accelerator vendors. We investigate the efficacy of different methods of load levels scaling and our methodology's reproducibility. We show that the methodology is able to produce consistent and reproducible results with a maximum coefficient of variation of 1.4% regarding power consumption.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131299372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Predicting Server Power Consumption from Standard Rating Results 根据标准评级结果预测服务器功耗

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310298

J. V. Kistowski, Johannes Grohmann, Norbert Schmitt, Samuel Kounev

Data center providers and server operators try to reduce the power consumption of their servers. Finding an energy efficient server for a specific target application is a first step in this regard. Estimating the power consumption of an application on an unavailable server is difficult, as nameplate power values are generally overestimations. Offline power models are able to predict the consumption accurately, but are usually intended for system design, requiring very specific and detailed knowledge about the system under consideration. In this paper, we introduce an offline power prediction method that uses the results of standard power rating tools. The method predicts the power consumption of a specific application for multiple load levels on a target server that is otherwise unavailable for testing. We evaluate our approach by predicting the power consumption of three applications on different physical servers. Our method is able to achieve an average prediction error of 9.49% for three workloads running on real-world, physical servers.

数据中心提供商和服务器运营商试图降低其服务器的功耗。在这方面，第一步是为特定的目标应用程序找到节能的服务器。在不可用的服务器上估计应用程序的功耗是很困难的，因为铭牌功率值通常被高估了。离线功率模型能够准确地预测功耗，但通常用于系统设计，需要非常具体和详细的系统知识。本文介绍了一种基于标准功率评估工具的离线功率预测方法。该方法预测目标服务器上多个负载级别的特定应用程序的功耗，否则无法用于测试。我们通过预测三个应用程序在不同物理服务器上的功耗来评估我们的方法。对于运行在真实世界物理服务器上的三种工作负载，我们的方法能够实现9.49%的平均预测误差。

引用次数: 1

Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan Apache Spark查询计划Java代码生成分析与优化

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310300

K. Ishizaki

Big data processing frameworks have received attention because of the importance of high performance computation. They are expected to quickly process a huge amount of data in memory with a simple programming model in a cluster. Apache Spark is becoming one of the most popular frameworks. Several studies have analyzed Spark programs and optimized their performance. Recent versions of Spark generate optimized Java code from a Spark program, but few research works have analyzed and improved such generated code to achieve better performance. Here, two types of problems were analyzed by inspecting generated code, namely, access to column-oriented storage and to a primitive-type array. The resulting performance issues in the generated code and were analyzed, and optimizations that can eliminate inefficient code were devised to solve the issues. The proposed optimizations were then implemented for Spark. Experimental results with the optimizations on a cluster of five Intel machines indicated performance improvement by up to 1.4x for TPC-H queries and by up to 1.4x for machine-learning programs. These optimizations have since been integrated into the release version of Apache Spark 2.3.

由于高性能计算的重要性，大数据处理框架受到了关注。它们需要在集群中使用简单的编程模型快速处理内存中的大量数据。Apache Spark正在成为最流行的框架之一。一些研究分析了Spark程序并优化了它们的性能。最近的Spark版本从Spark程序生成优化的Java代码，但很少有研究工作分析和改进这些生成的代码以获得更好的性能。这里，通过检查生成的代码分析了两种类型的问题，即访问面向列的存储和访问基元类型数组。分析了生成的代码中产生的性能问题，并设计了可以消除低效代码的优化来解决问题。然后在Spark上实现了建议的优化。在五台英特尔机器的集群上进行优化的实验结果表明，TPC-H查询的性能提高了1.4倍，机器学习程序的性能提高了1.4倍。这些优化已经集成到Apache Spark 2.3的发布版本中。

{"title":"Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan","authors":"K. Ishizaki","doi":"10.1145/3297663.3310300","DOIUrl":"https://doi.org/10.1145/3297663.3310300","url":null,"abstract":"Big data processing frameworks have received attention because of the importance of high performance computation. They are expected to quickly process a huge amount of data in memory with a simple programming model in a cluster. Apache Spark is becoming one of the most popular frameworks. Several studies have analyzed Spark programs and optimized their performance. Recent versions of Spark generate optimized Java code from a Spark program, but few research works have analyzed and improved such generated code to achieve better performance. Here, two types of problems were analyzed by inspecting generated code, namely, access to column-oriented storage and to a primitive-type array. The resulting performance issues in the generated code and were analyzed, and optimizations that can eliminate inefficient code were devised to solve the issues. The proposed optimizations were then implemented for Spark. Experimental results with the optimizations on a cluster of five Intel machines indicated performance improvement by up to 1.4x for TPC-H queries and by up to 1.4x for machine-learning programs. These optimizations have since been integrated into the release version of Apache Spark 2.3.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132551282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Accelerating Database Workloads with DM-WriteCache and Persistent Memory 使用DM-WriteCache和持久内存加速数据库工作负载

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309669

Rajesh Tadakamadla, Mikulás Patocka, Toshimitsu Kani, Scott J. Norton

Businesses today need systems that provide faster access to critical and frequently used data. Digitization has led to a rapid explosion of this business data, and thereby an increase in the database footprint. In-memory computing is one possible solution to meet the performance needs of such large databases, but the rate of data growth far exceeds the amount of memory that can hold the data. The computer industry is striving to remain on the cutting edge of technologies that accelerate performance, guard against data loss, and minimize downtime. The evolution towards a memory-centric architecture is driving development of newer memory technologies such as Persistent Memory (aka Storage Class Memory or Non-Volatile Memory [1]), as an answer to these pressing needs. In this paper, we present the use cases of storage class memory (or persistent memory) as a write-back cache to accelerate commit-sensitive online transaction processing (OLTP) database workloads. We provide an overview of Persistent Memory, a new technology that offers current generation of high-performance solutions a low latency-storage option that is byte-addressable. We also introduce the Linux kernel's new feature "DM-WriteCache", a write-back cache decades the computing industry has been researching ways to reduce the performance gap implemented on top of persistent memory solutions. And finally we present data from our tests that demonstrate how this technology adoption can enable existing OLTP applications to scale their performance.

今天的企业需要能够更快地访问关键和经常使用的数据的系统。数字化导致了这种业务数据的快速爆炸，从而增加了数据库的占用空间。内存中计算是满足此类大型数据库的性能需求的一种可能的解决方案，但是数据增长的速度远远超过可以容纳数据的内存量。计算机行业正在努力保持在加速性能、防止数据丢失和最小化停机时间的技术前沿。以内存为中心的架构的发展正在推动更新的内存技术的发展，如持久内存(又名存储类内存或非易失性内存[1])，作为这些迫切需求的答案。在本文中，我们介绍了存储类内存(或持久内存)作为回写缓存的用例，以加速提交敏感的在线事务处理(OLTP)数据库工作负载。我们提供持久性内存的概述，持久性内存是一种新技术，它提供了当前一代高性能解决方案的低延迟存储选项，是字节寻址的。我们还介绍了Linux内核的新特性“DM-WriteCache”，这是一种回写缓存，计算行业几十年来一直在研究减少在持久内存解决方案之上实现的性能差距的方法。最后，我们提供了来自测试的数据，这些数据演示了采用这种技术如何使现有OLTP应用程序能够扩展其性能。

{"title":"Accelerating Database Workloads with DM-WriteCache and Persistent Memory","authors":"Rajesh Tadakamadla, Mikulás Patocka, Toshimitsu Kani, Scott J. Norton","doi":"10.1145/3297663.3309669","DOIUrl":"https://doi.org/10.1145/3297663.3309669","url":null,"abstract":"Businesses today need systems that provide faster access to critical and frequently used data. Digitization has led to a rapid explosion of this business data, and thereby an increase in the database footprint. In-memory computing is one possible solution to meet the performance needs of such large databases, but the rate of data growth far exceeds the amount of memory that can hold the data. The computer industry is striving to remain on the cutting edge of technologies that accelerate performance, guard against data loss, and minimize downtime. The evolution towards a memory-centric architecture is driving development of newer memory technologies such as Persistent Memory (aka Storage Class Memory or Non-Volatile Memory [1]), as an answer to these pressing needs. In this paper, we present the use cases of storage class memory (or persistent memory) as a write-back cache to accelerate commit-sensitive online transaction processing (OLTP) database workloads. We provide an overview of Persistent Memory, a new technology that offers current generation of high-performance solutions a low latency-storage option that is byte-addressable. We also introduce the Linux kernel's new feature \"DM-WriteCache\", a write-back cache decades the computing industry has been researching ways to reduce the performance gap implemented on top of persistent memory solutions. And finally we present data from our tests that demonstrate how this technology adoption can enable existing OLTP applications to scale their performance.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133144389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects 高带宽互连中CUDA通信原语的特性评估

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310299

Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu

Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.

数据密集型应用(如机器学习和分析)产生了对更快互连的需求，以避免内存带宽墙，并允许gpu有效地用于较低计算强度的任务。这导致广泛采用具有不同底层互连的异构系统，并将理解和复制数据的任务委托给系统或应用程序开发人员。malloc和memcpy不再是唯一或主要的数据传输方式;应用程序开发人员面临着其他选项，如统一内存和零复制内存。这些系统上的数据传输性能现在受到许多因素的影响，包括数据传输方式、系统互连硬件细节、CPU缓存状态、CPU电源管理状态、驱动程序策略、虚拟内存分页效率和数据放置。本文介绍了Comm Scope，这是一组为系统和应用程序开发人员设计的微基准测试，用于了解不同数据放置和交换场景中的内存传输行为。Comm|Scope全面测量CUDA数据传输原语的延迟和带宽，并通过控制CPU缓存，时钟频率避免了ad-hoc测量中的常见陷阱，并避免了测量方法在可能的情况下测量同步成本。本文还介绍了Comm|Scope在具有POWER和x86 CPU架构以及PCIe 3、NVLink 1和NVLink 2互连的系统上的评估。选择这些系统作为当前高性能GPU平台的代表性配置。Comm Scope测量可以用于更新当前系统中数据传输方法的相对性能的见解。这项工作还报告了高级系统设计选择如何影响这些数据传输的性能，以及开发人员如何在这些系统上优化应用程序的见解。

{"title":"Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects","authors":"Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1145/3297663.3310299","DOIUrl":"https://doi.org/10.1145/3297663.3310299","url":null,"abstract":"Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116572248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Multi-Objective Mobile Edge Provisioning in Small Cell Clouds 小蜂窝云中的多目标移动边缘配置

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310301

Vincenzo De Maio, I. Brandić

In recent years, Mobile Cloud Computing (MCC) has been proposed as a solution to enhance the capabilities of user equipment (UE), such as smartphones, tablets and laptops. However, offloading to conventional Cloud introduces significant execution delays that are inconvenient in case of near real-time applications. Mobile Edge Computing (MEC) has been proposed as a solution to this problem. MEC brings computational and storage resources closer to the UE, enabling to offload near real-time applications from the UE while meeting strict latency requirements. However, it is very difficult for Edge providers to determine how many Edge nodes are required to provide MEC services, in order to guarantee a high QoS and to maximize their profit. In this paper, we investigate the static provisioning of Edge nodes in a area representing a cellular network in order to guarantee the required QoS to the user without affecting providers' profits. First, we design a model for MEC offloading considering user satisfaction and provider's costs. Then, we design a simulation framework based on this model. Finally, we design a multi-objective algorithm to identify a deployment solution that is a trade-off between user satisfaction and provider profit. Results show that our algorithm can guarantee a user satisfaction above 80%, with a profit for the provider of up 4 times their cost.

近年来，移动云计算(MCC)已被提出作为一种解决方案，以增强用户设备(UE)的能力，如智能手机、平板电脑和笔记本电脑。然而，卸载到传统的云会带来严重的执行延迟，这在接近实时的应用程序中是不方便的。移动边缘计算(MEC)被提出作为解决这一问题的一种方法。MEC使计算和存储资源更接近终端，能够从终端卸载接近实时的应用程序，同时满足严格的延迟要求。然而，边缘提供商很难确定需要多少边缘节点来提供MEC服务，以保证高QoS并最大化其利润。在本文中，我们研究了在代表蜂窝网络的区域中静态提供边缘节点，以保证向用户提供所需的QoS而不影响提供商的利润。首先，我们设计了一个考虑用户满意度和供应商成本的MEC卸载模型。然后，基于该模型设计了仿真框架。最后，我们设计了一个多目标算法来确定在用户满意度和提供商利润之间权衡的部署解决方案。结果表明，我们的算法可以保证80%以上的用户满意度，并为供应商带来高达4倍成本的利润。

{"title":"Multi-Objective Mobile Edge Provisioning in Small Cell Clouds","authors":"Vincenzo De Maio, I. Brandić","doi":"10.1145/3297663.3310301","DOIUrl":"https://doi.org/10.1145/3297663.3310301","url":null,"abstract":"In recent years, Mobile Cloud Computing (MCC) has been proposed as a solution to enhance the capabilities of user equipment (UE), such as smartphones, tablets and laptops. However, offloading to conventional Cloud introduces significant execution delays that are inconvenient in case of near real-time applications. Mobile Edge Computing (MEC) has been proposed as a solution to this problem. MEC brings computational and storage resources closer to the UE, enabling to offload near real-time applications from the UE while meeting strict latency requirements. However, it is very difficult for Edge providers to determine how many Edge nodes are required to provide MEC services, in order to guarantee a high QoS and to maximize their profit. In this paper, we investigate the static provisioning of Edge nodes in a area representing a cellular network in order to guarantee the required QoS to the user without affecting providers' profits. First, we design a model for MEC offloading considering user satisfaction and provider's costs. Then, we design a simulation framework based on this model. Finally, we design a multi-objective algorithm to identify a deployment solution that is a trade-off between user satisfaction and provider profit. Results show that our algorithm can guarantee a user satisfaction above 80%, with a profit for the provider of up 4 times their cost.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127385620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Performance Scaling of Cassandra on High-Thread Count Servers Cassandra在高线程数服务器上的性能扩展

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309668

D. Talreja, K. Lahiri, Subramaniam Kalambur, Prakash S. Raghavendra

NoSQL databases are commonly used today in cloud deployments due to their ability to "scale-out" and effectively use distributed computing resources in a data center. At the same time, cloud servers are also witnessing rapid growth in CPU core counts, memory bandwidth, and memory capacity. Hence, apart from scaling out effectively, it's important to consider how such workloads "scale-up" within a single system, so that they can make the best use of available resources. In this paper, we describe our experiences studying the performance scaling characteristics of Cassandra, a popular open-source, column-oriented database, on a single high-thread count dual socket server. We demonstrate that using commonly used benchmarking practices, Cassandra does not scale well on such systems. Next, we show how by taking into account specific knowledge of the underlying topology of the server architecture, we can achieve substantial improvements in performance scalability. We report on how, during the course of our work, we uncovered an area for performance improvement in the official open-source implementation of the Java platform with respect to NUMA awareness. We show how optimizing this resulted in 27% throughput gain for Cassandra under studied configurations. As a result of these optimizations, using standard workload generators, we obtained up to 1.44x and 2.55x improvements in Cassandra throughput over baseline single and dual-socket performance measurements respectively. On wider testing across a variety of workloads, we achieved excellent performance scaling, averaging 98% efficiency within a socket and 90% efficiency at the system-level.

由于NoSQL数据库具有“向外扩展”的能力，并且能够有效地在数据中心中使用分布式计算资源，因此在云部署中经常使用NoSQL数据库。与此同时，云服务器也见证了CPU核心数量、内存带宽和内存容量的快速增长。因此，除了有效地向外扩展之外，重要的是要考虑如何在单个系统中“扩展”此类工作负载，以便它们能够充分利用可用资源。在本文中，我们描述了我们在单个高线程数双套接字服务器上研究Cassandra(一个流行的开源、面向列的数据库)性能扩展特征的经验。我们证明，使用常用的基准测试实践，Cassandra在这样的系统上不能很好地扩展。接下来，我们将展示如何通过考虑服务器体系结构的底层拓扑的特定知识，在性能可伸缩性方面实现实质性的改进。我们报告了在我们的工作过程中，我们如何在Java平台的官方开源实现中发现了一个关于NUMA感知的性能改进领域。我们展示了在所研究的配置下，优化这一点如何使Cassandra的吞吐量增加27%。通过这些优化，使用标准工作负载生成器，我们获得了Cassandra吞吐量比基线单插座和双插座性能测量分别提高1.44倍和2.55倍。在跨各种工作负载的更广泛测试中，我们实现了出色的性能扩展，套接字内的平均效率为98%，系统级的平均效率为90%。

{"title":"Performance Scaling of Cassandra on High-Thread Count Servers","authors":"D. Talreja, K. Lahiri, Subramaniam Kalambur, Prakash S. Raghavendra","doi":"10.1145/3297663.3309668","DOIUrl":"https://doi.org/10.1145/3297663.3309668","url":null,"abstract":"NoSQL databases are commonly used today in cloud deployments due to their ability to \"scale-out\" and effectively use distributed computing resources in a data center. At the same time, cloud servers are also witnessing rapid growth in CPU core counts, memory bandwidth, and memory capacity. Hence, apart from scaling out effectively, it's important to consider how such workloads \"scale-up\" within a single system, so that they can make the best use of available resources. In this paper, we describe our experiences studying the performance scaling characteristics of Cassandra, a popular open-source, column-oriented database, on a single high-thread count dual socket server. We demonstrate that using commonly used benchmarking practices, Cassandra does not scale well on such systems. Next, we show how by taking into account specific knowledge of the underlying topology of the server architecture, we can achieve substantial improvements in performance scalability. We report on how, during the course of our work, we uncovered an area for performance improvement in the official open-source implementation of the Java platform with respect to NUMA awareness. We show how optimizing this resulted in 27% throughput gain for Cassandra under studied configurations. As a result of these optimizations, using standard workload generators, we obtained up to 1.44x and 2.55x improvements in Cassandra throughput over baseline single and dual-socket performance measurements respectively. On wider testing across a variety of workloads, we achieved excellent performance scaling, averaging 98% efficiency within a socket and 90% efficiency at the system-level.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127757783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3