Automation and the use of robotic components within business processes is in vogue across retail and manufacturing industries. However, a structured way of analyzing performance improvements provided by automation in complex workflows is still at a nascent stage. In this paper, we consider the common Industry 4.0 automation workflow resource patterns and model them within a hybrid queuing network. The queuing stations are replaced by scale up, scale out and hybrid scale automation patterns, to examine improvements in end-to-end process performance. We exhaustively simulate the throughput, response time, utilization and operating costs at higher concurrencies using Mean Value Analysis (MVA) algorithms. The queues are analyzed for cases with multiple classes, batch/transactional processing and load dependent service demands. These solutions are demonstrated over an exemplar use case of automation in Industry 4.0 warehouse automation workflows. A structured process of automation workflow performance analysis will prove valuable across industrial deployments.
自动化和在业务流程中使用机器人组件在零售和制造业中非常流行。然而,分析复杂工作流中自动化所提供的性能改进的结构化方法仍处于初级阶段。在本文中,我们考虑了常见的工业4.0自动化工作流资源模式,并在混合排队网络中对它们进行了建模。排队站被向上扩展、向外扩展和混合扩展自动化模式所取代,以检查端到端流程性能的改进。我们使用均值分析(Mean Value Analysis, MVA)算法详尽地模拟了更高并发下的吞吐量、响应时间、利用率和操作成本。针对具有多个类、批处理/事务处理和负载相关服务需求的情况,对队列进行分析。这些解决方案通过工业4.0仓库自动化工作流中的自动化示例用例进行了演示。自动化工作流性能分析的结构化过程将在整个工业部署中证明是有价值的。
{"title":"Towards Structured Performance Analysis of Industry 4.0 Workflow Automation Resources","authors":"A. Kattepur","doi":"10.1145/3297663.3309671","DOIUrl":"https://doi.org/10.1145/3297663.3309671","url":null,"abstract":"Automation and the use of robotic components within business processes is in vogue across retail and manufacturing industries. However, a structured way of analyzing performance improvements provided by automation in complex workflows is still at a nascent stage. In this paper, we consider the common Industry 4.0 automation workflow resource patterns and model them within a hybrid queuing network. The queuing stations are replaced by scale up, scale out and hybrid scale automation patterns, to examine improvements in end-to-end process performance. We exhaustively simulate the throughput, response time, utilization and operating costs at higher concurrencies using Mean Value Analysis (MVA) algorithms. The queues are analyzed for cases with multiple classes, batch/transactional processing and load dependent service demands. These solutions are demonstrated over an exemplar use case of automation in Industry 4.0 warehouse automation workflows. A structured process of automation workflow performance analysis will prove valuable across industrial deployments.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"4 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126144019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.
{"title":"Practical Reliability Analysis of GPGPUs in the Wild: From Systems to Applications","authors":"E. Smirni","doi":"10.1145/3297663.3310291","DOIUrl":"https://doi.org/10.1145/3297663.3310291","url":null,"abstract":"General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126932589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Burger, H. Koziolek, Julius Rückert, Marie Platenius-Mohr, G. Stomberg
The OPC UA communication architecture is currently becoming an integral part of industrial automation systems, which control complex production processes, such as electric power generation or paper production. With a recently released extension for pub/sub communication, OPC UA can now also support fast cyclic control applications, but the bottlenecks of OPC UA implementations and their scalability on resource-constrained industrial devices are not yet well understood. Former OPC UA performance evaluations mainly concerned client/server round-trip times or focused on jitter, but did not explore resource bottlenecks or create predictive performance models. We have carried out extensive performance measurements with OPC UA client/server and pub/sub communication and created a CPU utilization prediction model based on linear regression that can be used to size hardware environments. We found that the server CPU is the main bottleneck for OPC UA pub/sub communication, but allows a throughput of up to 40,000 signals per second on a Raspberry Pi Zero. We also found that the client/server session management overhead can severely impact performance, if more than 20 clients access a single server.
OPC UA通信架构目前正在成为工业自动化系统的一个组成部分,用于控制复杂的生产过程,如发电或造纸。随着最近发布的pub/sub通信扩展,OPC UA现在也可以支持快速循环控制应用程序,但是OPC UA实现的瓶颈及其在资源受限的工业设备上的可扩展性尚未得到很好的理解。以前的OPC UA性能评估主要关注客户端/服务器往返时间或关注抖动,但没有探索资源瓶颈或创建预测性能模型。我们使用OPC UA客户机/服务器和pub/sub通信进行了广泛的性能测量,并创建了基于线性回归的CPU利用率预测模型,该模型可用于调整硬件环境。我们发现服务器CPU是OPC UA pub/sub通信的主要瓶颈,但在Raspberry Pi Zero上允许每秒高达40,000个信号的吞吐量。我们还发现,如果超过20个客户机访问单个服务器,客户机/服务器会话管理开销会严重影响性能。
{"title":"Bottleneck Identification and Performance Modeling of OPC UA Communication Models","authors":"Andreas Burger, H. Koziolek, Julius Rückert, Marie Platenius-Mohr, G. Stomberg","doi":"10.1145/3297663.3309670","DOIUrl":"https://doi.org/10.1145/3297663.3309670","url":null,"abstract":"The OPC UA communication architecture is currently becoming an integral part of industrial automation systems, which control complex production processes, such as electric power generation or paper production. With a recently released extension for pub/sub communication, OPC UA can now also support fast cyclic control applications, but the bottlenecks of OPC UA implementations and their scalability on resource-constrained industrial devices are not yet well understood. Former OPC UA performance evaluations mainly concerned client/server round-trip times or focused on jitter, but did not explore resource bottlenecks or create predictive performance models. We have carried out extensive performance measurements with OPC UA client/server and pub/sub communication and created a CPU utilization prediction model based on linear regression that can be used to size hardware environments. We found that the server CPU is the main bottleneck for OPC UA pub/sub communication, but allows a throughput of up to 40,000 signals per second on a Raspberry Pi Zero. We also found that the client/server session management overhead can severely impact performance, if more than 20 clients access a single server.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130991192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. V. Kistowski, Johann Pais, T. Wahl, K. Lange, Hansfried Block, John Beckett, Samuel Kounev
General Purpose Graphics Processing Units (GPGPUs) are becoming more and more common in current servers and data centers, which in turn consume a significant amount of electrical power. Measuring and benchmarking this power consumption is important as it helps with optimization and selection of these servers. However, benchmarking and comparing the energy efficiency of GPGPU workloads is challenging as standardized workloads are rare and standardized power and efficiency measurement methods and metrics do not exist. In addition, not all GPGPU systems run at maximum load all the time. Systems that are utilized in transactional, request driven workloads, for example, can run at lower utilization levels. Existing benchmarks for GPGPU systems primarily consider performance and are intended only to run at maximum load. They do not measure performance or energy efficiency at other loads. In turn, server energy-efficiency benchmarks that consider multiple load levels do not address GPGPUs. This paper introduces a measurement methodology for servers with GPGPU accelerators that considers multiple load levels for transactional workloads. The methodology also addresses verifiability of results in order to achieve comparability of different device solutions. We analyze our methodology on three different systems with solutions from two different accelerator vendors. We investigate the efficacy of different methods of load levels scaling and our methodology's reproducibility. We show that the methodology is able to produce consistent and reproducible results with a maximum coefficient of variation of 1.4% regarding power consumption.
{"title":"Measuring the Energy Efficiency of Transactional Loads on GPGPU","authors":"J. V. Kistowski, Johann Pais, T. Wahl, K. Lange, Hansfried Block, John Beckett, Samuel Kounev","doi":"10.1145/3297663.3309667","DOIUrl":"https://doi.org/10.1145/3297663.3309667","url":null,"abstract":"General Purpose Graphics Processing Units (GPGPUs) are becoming more and more common in current servers and data centers, which in turn consume a significant amount of electrical power. Measuring and benchmarking this power consumption is important as it helps with optimization and selection of these servers. However, benchmarking and comparing the energy efficiency of GPGPU workloads is challenging as standardized workloads are rare and standardized power and efficiency measurement methods and metrics do not exist. In addition, not all GPGPU systems run at maximum load all the time. Systems that are utilized in transactional, request driven workloads, for example, can run at lower utilization levels. Existing benchmarks for GPGPU systems primarily consider performance and are intended only to run at maximum load. They do not measure performance or energy efficiency at other loads. In turn, server energy-efficiency benchmarks that consider multiple load levels do not address GPGPUs. This paper introduces a measurement methodology for servers with GPGPU accelerators that considers multiple load levels for transactional workloads. The methodology also addresses verifiability of results in order to achieve comparability of different device solutions. We analyze our methodology on three different systems with solutions from two different accelerator vendors. We investigate the efficacy of different methods of load levels scaling and our methodology's reproducibility. We show that the methodology is able to produce consistent and reproducible results with a maximum coefficient of variation of 1.4% regarding power consumption.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131299372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. V. Kistowski, Johannes Grohmann, Norbert Schmitt, Samuel Kounev
Data center providers and server operators try to reduce the power consumption of their servers. Finding an energy efficient server for a specific target application is a first step in this regard. Estimating the power consumption of an application on an unavailable server is difficult, as nameplate power values are generally overestimations. Offline power models are able to predict the consumption accurately, but are usually intended for system design, requiring very specific and detailed knowledge about the system under consideration. In this paper, we introduce an offline power prediction method that uses the results of standard power rating tools. The method predicts the power consumption of a specific application for multiple load levels on a target server that is otherwise unavailable for testing. We evaluate our approach by predicting the power consumption of three applications on different physical servers. Our method is able to achieve an average prediction error of 9.49% for three workloads running on real-world, physical servers.
{"title":"Predicting Server Power Consumption from Standard Rating Results","authors":"J. V. Kistowski, Johannes Grohmann, Norbert Schmitt, Samuel Kounev","doi":"10.1145/3297663.3310298","DOIUrl":"https://doi.org/10.1145/3297663.3310298","url":null,"abstract":"Data center providers and server operators try to reduce the power consumption of their servers. Finding an energy efficient server for a specific target application is a first step in this regard. Estimating the power consumption of an application on an unavailable server is difficult, as nameplate power values are generally overestimations. Offline power models are able to predict the consumption accurately, but are usually intended for system design, requiring very specific and detailed knowledge about the system under consideration. In this paper, we introduce an offline power prediction method that uses the results of standard power rating tools. The method predicts the power consumption of a specific application for multiple load levels on a target server that is otherwise unavailable for testing. We evaluate our approach by predicting the power consumption of three applications on different physical servers. Our method is able to achieve an average prediction error of 9.49% for three workloads running on real-world, physical servers.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123848159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big data processing frameworks have received attention because of the importance of high performance computation. They are expected to quickly process a huge amount of data in memory with a simple programming model in a cluster. Apache Spark is becoming one of the most popular frameworks. Several studies have analyzed Spark programs and optimized their performance. Recent versions of Spark generate optimized Java code from a Spark program, but few research works have analyzed and improved such generated code to achieve better performance. Here, two types of problems were analyzed by inspecting generated code, namely, access to column-oriented storage and to a primitive-type array. The resulting performance issues in the generated code and were analyzed, and optimizations that can eliminate inefficient code were devised to solve the issues. The proposed optimizations were then implemented for Spark. Experimental results with the optimizations on a cluster of five Intel machines indicated performance improvement by up to 1.4x for TPC-H queries and by up to 1.4x for machine-learning programs. These optimizations have since been integrated into the release version of Apache Spark 2.3.
{"title":"Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan","authors":"K. Ishizaki","doi":"10.1145/3297663.3310300","DOIUrl":"https://doi.org/10.1145/3297663.3310300","url":null,"abstract":"Big data processing frameworks have received attention because of the importance of high performance computation. They are expected to quickly process a huge amount of data in memory with a simple programming model in a cluster. Apache Spark is becoming one of the most popular frameworks. Several studies have analyzed Spark programs and optimized their performance. Recent versions of Spark generate optimized Java code from a Spark program, but few research works have analyzed and improved such generated code to achieve better performance. Here, two types of problems were analyzed by inspecting generated code, namely, access to column-oriented storage and to a primitive-type array. The resulting performance issues in the generated code and were analyzed, and optimizations that can eliminate inefficient code were devised to solve the issues. The proposed optimizations were then implemented for Spark. Experimental results with the optimizations on a cluster of five Intel machines indicated performance improvement by up to 1.4x for TPC-H queries and by up to 1.4x for machine-learning programs. These optimizations have since been integrated into the release version of Apache Spark 2.3.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132551282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajesh Tadakamadla, Mikulás Patocka, Toshimitsu Kani, Scott J. Norton
Businesses today need systems that provide faster access to critical and frequently used data. Digitization has led to a rapid explosion of this business data, and thereby an increase in the database footprint. In-memory computing is one possible solution to meet the performance needs of such large databases, but the rate of data growth far exceeds the amount of memory that can hold the data. The computer industry is striving to remain on the cutting edge of technologies that accelerate performance, guard against data loss, and minimize downtime. The evolution towards a memory-centric architecture is driving development of newer memory technologies such as Persistent Memory (aka Storage Class Memory or Non-Volatile Memory [1]), as an answer to these pressing needs. In this paper, we present the use cases of storage class memory (or persistent memory) as a write-back cache to accelerate commit-sensitive online transaction processing (OLTP) database workloads. We provide an overview of Persistent Memory, a new technology that offers current generation of high-performance solutions a low latency-storage option that is byte-addressable. We also introduce the Linux kernel's new feature "DM-WriteCache", a write-back cache decades the computing industry has been researching ways to reduce the performance gap implemented on top of persistent memory solutions. And finally we present data from our tests that demonstrate how this technology adoption can enable existing OLTP applications to scale their performance.
{"title":"Accelerating Database Workloads with DM-WriteCache and Persistent Memory","authors":"Rajesh Tadakamadla, Mikulás Patocka, Toshimitsu Kani, Scott J. Norton","doi":"10.1145/3297663.3309669","DOIUrl":"https://doi.org/10.1145/3297663.3309669","url":null,"abstract":"Businesses today need systems that provide faster access to critical and frequently used data. Digitization has led to a rapid explosion of this business data, and thereby an increase in the database footprint. In-memory computing is one possible solution to meet the performance needs of such large databases, but the rate of data growth far exceeds the amount of memory that can hold the data. The computer industry is striving to remain on the cutting edge of technologies that accelerate performance, guard against data loss, and minimize downtime. The evolution towards a memory-centric architecture is driving development of newer memory technologies such as Persistent Memory (aka Storage Class Memory or Non-Volatile Memory [1]), as an answer to these pressing needs. In this paper, we present the use cases of storage class memory (or persistent memory) as a write-back cache to accelerate commit-sensitive online transaction processing (OLTP) database workloads. We provide an overview of Persistent Memory, a new technology that offers current generation of high-performance solutions a low latency-storage option that is byte-addressable. We also introduce the Linux kernel's new feature \"DM-WriteCache\", a write-back cache decades the computing industry has been researching ways to reduce the performance gap implemented on top of persistent memory solutions. And finally we present data from our tests that demonstrate how this technology adoption can enable existing OLTP applications to scale their performance.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133144389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu
Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.
{"title":"Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects","authors":"Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Chung, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1145/3297663.3310299","DOIUrl":"https://doi.org/10.1145/3297663.3310299","url":null,"abstract":"Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116572248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Mobile Cloud Computing (MCC) has been proposed as a solution to enhance the capabilities of user equipment (UE), such as smartphones, tablets and laptops. However, offloading to conventional Cloud introduces significant execution delays that are inconvenient in case of near real-time applications. Mobile Edge Computing (MEC) has been proposed as a solution to this problem. MEC brings computational and storage resources closer to the UE, enabling to offload near real-time applications from the UE while meeting strict latency requirements. However, it is very difficult for Edge providers to determine how many Edge nodes are required to provide MEC services, in order to guarantee a high QoS and to maximize their profit. In this paper, we investigate the static provisioning of Edge nodes in a area representing a cellular network in order to guarantee the required QoS to the user without affecting providers' profits. First, we design a model for MEC offloading considering user satisfaction and provider's costs. Then, we design a simulation framework based on this model. Finally, we design a multi-objective algorithm to identify a deployment solution that is a trade-off between user satisfaction and provider profit. Results show that our algorithm can guarantee a user satisfaction above 80%, with a profit for the provider of up 4 times their cost.
{"title":"Multi-Objective Mobile Edge Provisioning in Small Cell Clouds","authors":"Vincenzo De Maio, I. Brandić","doi":"10.1145/3297663.3310301","DOIUrl":"https://doi.org/10.1145/3297663.3310301","url":null,"abstract":"In recent years, Mobile Cloud Computing (MCC) has been proposed as a solution to enhance the capabilities of user equipment (UE), such as smartphones, tablets and laptops. However, offloading to conventional Cloud introduces significant execution delays that are inconvenient in case of near real-time applications. Mobile Edge Computing (MEC) has been proposed as a solution to this problem. MEC brings computational and storage resources closer to the UE, enabling to offload near real-time applications from the UE while meeting strict latency requirements. However, it is very difficult for Edge providers to determine how many Edge nodes are required to provide MEC services, in order to guarantee a high QoS and to maximize their profit. In this paper, we investigate the static provisioning of Edge nodes in a area representing a cellular network in order to guarantee the required QoS to the user without affecting providers' profits. First, we design a model for MEC offloading considering user satisfaction and provider's costs. Then, we design a simulation framework based on this model. Finally, we design a multi-objective algorithm to identify a deployment solution that is a trade-off between user satisfaction and provider profit. Results show that our algorithm can guarantee a user satisfaction above 80%, with a profit for the provider of up 4 times their cost.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127385620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Talreja, K. Lahiri, Subramaniam Kalambur, Prakash S. Raghavendra
NoSQL databases are commonly used today in cloud deployments due to their ability to "scale-out" and effectively use distributed computing resources in a data center. At the same time, cloud servers are also witnessing rapid growth in CPU core counts, memory bandwidth, and memory capacity. Hence, apart from scaling out effectively, it's important to consider how such workloads "scale-up" within a single system, so that they can make the best use of available resources. In this paper, we describe our experiences studying the performance scaling characteristics of Cassandra, a popular open-source, column-oriented database, on a single high-thread count dual socket server. We demonstrate that using commonly used benchmarking practices, Cassandra does not scale well on such systems. Next, we show how by taking into account specific knowledge of the underlying topology of the server architecture, we can achieve substantial improvements in performance scalability. We report on how, during the course of our work, we uncovered an area for performance improvement in the official open-source implementation of the Java platform with respect to NUMA awareness. We show how optimizing this resulted in 27% throughput gain for Cassandra under studied configurations. As a result of these optimizations, using standard workload generators, we obtained up to 1.44x and 2.55x improvements in Cassandra throughput over baseline single and dual-socket performance measurements respectively. On wider testing across a variety of workloads, we achieved excellent performance scaling, averaging 98% efficiency within a socket and 90% efficiency at the system-level.
{"title":"Performance Scaling of Cassandra on High-Thread Count Servers","authors":"D. Talreja, K. Lahiri, Subramaniam Kalambur, Prakash S. Raghavendra","doi":"10.1145/3297663.3309668","DOIUrl":"https://doi.org/10.1145/3297663.3309668","url":null,"abstract":"NoSQL databases are commonly used today in cloud deployments due to their ability to \"scale-out\" and effectively use distributed computing resources in a data center. At the same time, cloud servers are also witnessing rapid growth in CPU core counts, memory bandwidth, and memory capacity. Hence, apart from scaling out effectively, it's important to consider how such workloads \"scale-up\" within a single system, so that they can make the best use of available resources. In this paper, we describe our experiences studying the performance scaling characteristics of Cassandra, a popular open-source, column-oriented database, on a single high-thread count dual socket server. We demonstrate that using commonly used benchmarking practices, Cassandra does not scale well on such systems. Next, we show how by taking into account specific knowledge of the underlying topology of the server architecture, we can achieve substantial improvements in performance scalability. We report on how, during the course of our work, we uncovered an area for performance improvement in the official open-source implementation of the Java platform with respect to NUMA awareness. We show how optimizing this resulted in 27% throughput gain for Cassandra under studied configurations. As a result of these optimizations, using standard workload generators, we obtained up to 1.44x and 2.55x improvements in Cassandra throughput over baseline single and dual-socket performance measurements respectively. On wider testing across a variety of workloads, we achieved excellent performance scaling, averaging 98% efficiency within a socket and 90% efficiency at the system-level.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127757783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}