Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering最新文献_第2页

A Cloud Performance Analytics Framework to Support Online Performance Diagnosis and Monitoring Tools 支持在线性能诊断和监控工具的云性能分析框架

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309675

Amit Banerjee, Abhishek Srivastava

Traditionally, performance analysis, de-bugging, triaging, troubleshooting, and optimization are left in the hands of performance experts. The main rationale behind this is that performance engi-neering is considered a specialized do-main expertise, and therefore left to the trained hands of experts. However, this approach requires human manpower to be put behind every performance escala-tion. This is no longer future proof in enterprise environments because of the following reasons: (i) Enterprise customers now expect much quicker performance troubleshooting, particularly in cloud platforms as Soft-ware As A Service (SaaS) offerings where the billing is subscription based, (ii) As products grow more distributed and complex, the number of performance met-rics required to troubleshoot a perfor-mance problem implodes, making it very time consuming for human intervention and analysis, and (iii) Our past experi-ences show that while many customers land up on similar performance issues, the human effort to troubleshoot each of these performance issues in a different infrastructural environment is non-trivial. We believe that data analytics platforms that can quickly mine through performance data and point out potential bottlenecks offer a good solution for non-domain experts to debug and solve a performance issue. In this work, we showcase a cloud based performance data analytics framework which can be leveraged to build tools which analyze and root-cause performance issues in enterprise sys-tems. We describe the architecture of this framework which consists of: (i) A cloud service (which we term as a plugin), (ii) Supporting libraries that may be used to interact with this plugin from end-systems such as computer serv-ers or appliance Virtual Machines (VMs), and (iii) A solution to monitor and ana-lyze the results delivered by the plugin. We demonstrate how this platform can be used to develop different perfor-mance analyses and debugging tools. We provide one example of a tool that we have built on top of this framework and released: VMware Virtual SAN (vSAN) per-formance diagnostics. We specifically discuss how collecting performance data in the cloud from over a thousand deployments, and then analyz-ing to detect performance issues, helped us write rules that can easily detect similar performance issues. Finally, we discuss a framework for monitoring the performance of the rules and improving them.

传统上，性能分析、调试、分类、故障排除和优化都是由性能专家负责的。这背后的主要理由是，性能工程被认为是一种专门的技术，因此留给训练有素的专家。然而，这种方法需要在每次性能提升背后投入人力。由于以下原因，这在企业环境中不再是未来的证明:(i)企业客户现在期望更快的性能故障排除，特别是在云平台软件即服务(SaaS)产品中，计费是基于订阅的;(ii)随着产品变得更加分布式和复杂，故障排除性能问题所需的性能指标数量急剧增加，使得人工干预和分析非常耗时;(iii)我们过去的经验表明，虽然许多客户遇到了类似的性能问题，但在不同的基础设施环境中对这些性能问题进行故障排除的人力资源是非常重要的。我们相信，能够快速挖掘性能数据并指出潜在瓶颈的数据分析平台为非领域专家调试和解决性能问题提供了一个很好的解决方案。在这项工作中，我们展示了一个基于云的性能数据分析框架，可以利用它来构建分析企业系统中的性能问题并从根本上解决问题的工具。我们描述了这个框架的架构，它包括:(i)云服务(我们称之为插件)，(ii)支持库，可用于与终端系统(如计算机服务器或设备虚拟机(vm))的插件交互，以及(iii)监控和分析插件交付结果的解决方案。我们将演示如何使用该平台开发不同的性能分析和调试工具。我们提供了一个基于该框架构建并发布的工具示例:VMware Virtual SAN (vSAN)性能诊断。我们特别讨论了如何从一千多个部署中收集云中的性能数据，然后进行分析以检测性能问题，这有助于我们编写可以轻松检测类似性能问题的规则。最后，我们讨论了一个用于监控规则性能并对其进行改进的框架。

{"title":"A Cloud Performance Analytics Framework to Support Online Performance Diagnosis and Monitoring Tools","authors":"Amit Banerjee, Abhishek Srivastava","doi":"10.1145/3297663.3309675","DOIUrl":"https://doi.org/10.1145/3297663.3309675","url":null,"abstract":"Traditionally, performance analysis, de-bugging, triaging, troubleshooting, and optimization are left in the hands of performance experts. The main rationale behind this is that performance engi-neering is considered a specialized do-main expertise, and therefore left to the trained hands of experts. However, this approach requires human manpower to be put behind every performance escala-tion. This is no longer future proof in enterprise environments because of the following reasons: (i) Enterprise customers now expect much quicker performance troubleshooting, particularly in cloud platforms as Soft-ware As A Service (SaaS) offerings where the billing is subscription based, (ii) As products grow more distributed and complex, the number of performance met-rics required to troubleshoot a perfor-mance problem implodes, making it very time consuming for human intervention and analysis, and (iii) Our past experi-ences show that while many customers land up on similar performance issues, the human effort to troubleshoot each of these performance issues in a different infrastructural environment is non-trivial. We believe that data analytics platforms that can quickly mine through performance data and point out potential bottlenecks offer a good solution for non-domain experts to debug and solve a performance issue. In this work, we showcase a cloud based performance data analytics framework which can be leveraged to build tools which analyze and root-cause performance issues in enterprise sys-tems. We describe the architecture of this framework which consists of: (i) A cloud service (which we term as a plugin), (ii) Supporting libraries that may be used to interact with this plugin from end-systems such as computer serv-ers or appliance Virtual Machines (VMs), and (iii) A solution to monitor and ana-lyze the results delivered by the plugin. We demonstrate how this platform can be used to develop different perfor-mance analyses and debugging tools. We provide one example of a tool that we have built on top of this framework and released: VMware Virtual SAN (vSAN) per-formance diagnostics. We specifically discuss how collecting performance data in the cloud from over a thousand deployments, and then analyz-ing to detect performance issues, helped us write rules that can easily detect similar performance issues. Finally, we discuss a framework for monitoring the performance of the rules and improving them.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127857090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance Modeling for Cloud Microservice Applications 云微服务应用的性能建模

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310309

Anshul Jindal, Vladimir Podolskiy, M. Gerndt

Microservices enable a fine-grained control over the cloud applications that they constitute and thus became widely-used in the industry. Each microservice implements its own functionality and communicates with other microservices through language- and platform-agnostic API. The resources usage of microservices varies depending on the implemented functionality and the workload. Continuously increasing load or a sudden load spike may yield a violation of a service level objective (SLO). To characterize the behavior of a microservice application which is appropriate for the user, we define a MicroService Capacity (MSC) as a maximal rate of requests that can be served without violating SLO. The paper addresses the challenge of identifying MSC individually for each microservice. Finding individual capacities of microservices ensures the flexibility of the capacity planning for an application. This challenge is addressed by sandboxing a microservice and building its performance model. This approach was implemented in a tool Terminus. The tool estimates the capacity of a microservice on different deployment configurations by conducting a limited set of load tests followed by fitting an appropriate regression model to the acquired performance data. The evaluation of the microservice performance models on microservices of four different applications shown relatively accurate predictions with mean absolute percentage error (MAPE) less than 10%. The results of the proposed performance modeling for individual microservices are deemed as a major input for the microservice application performance modeling.

微服务支持对它们构成的云应用程序进行细粒度控制，因此在业界得到了广泛使用。每个微服务实现自己的功能，并通过与语言和平台无关的API与其他微服务通信。微服务的资源使用取决于实现的功能和工作负载。持续增加的负载或突然的负载峰值可能导致服务水平目标(SLO)的违反。为了描述适合用户的微服务应用程序的行为，我们将微服务容量(MSC)定义为在不违反SLO的情况下可以提供服务的最大请求速率。本文解决了为每个微服务单独识别MSC的挑战。找到微服务的单个容量可以确保应用程序容量规划的灵活性。这个挑战可以通过对微服务进行沙箱化并构建其性能模型来解决。这种方法是在工具Terminus中实现的。该工具通过执行一组有限的负载测试，然后将适当的回归模型拟合到获得的性能数据，来估计微服务在不同部署配置上的容量。对四种不同应用的微服务性能模型的评估显示出相对准确的预测，平均绝对百分比误差(MAPE)小于10%。单个微服务的性能建模建议的结果被视为微服务应用程序性能建模的主要输入。

{"title":"Performance Modeling for Cloud Microservice Applications","authors":"Anshul Jindal, Vladimir Podolskiy, M. Gerndt","doi":"10.1145/3297663.3310309","DOIUrl":"https://doi.org/10.1145/3297663.3310309","url":null,"abstract":"Microservices enable a fine-grained control over the cloud applications that they constitute and thus became widely-used in the industry. Each microservice implements its own functionality and communicates with other microservices through language- and platform-agnostic API. The resources usage of microservices varies depending on the implemented functionality and the workload. Continuously increasing load or a sudden load spike may yield a violation of a service level objective (SLO). To characterize the behavior of a microservice application which is appropriate for the user, we define a MicroService Capacity (MSC) as a maximal rate of requests that can be served without violating SLO. The paper addresses the challenge of identifying MSC individually for each microservice. Finding individual capacities of microservices ensures the flexibility of the capacity planning for an application. This challenge is addressed by sandboxing a microservice and building its performance model. This approach was implemented in a tool Terminus. The tool estimates the capacity of a microservice on different deployment configurations by conducting a limited set of load tests followed by fitting an appropriate regression model to the acquired performance data. The evaluation of the microservice performance models on microservices of four different applications shown relatively accurate predictions with mean absolute percentage error (MAPE) less than 10%. The results of the proposed performance modeling for individual microservices are deemed as a major input for the microservice application performance modeling.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121180333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 70

Profile-based Detection of Layered Bottlenecks 基于配置文件的分层瓶颈检测

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310296

T. Inagaki, Yohei Ueda, T. Nakaike, Moriyoshi Ohara

Detection of software bottlenecks which hinder utilizing hardware resources is a classic but complex problem due to the layered structures of the software bottlenecks. However, model-based approaches require a performance model given, which is impractical to maintain under today's agile development environment, and profile-based approaches do not handle the layered structures of the software bottlenecks. This paper proposes a novel approach of taking the best of both worlds which extracts a performance model from execution profiles of the target application to detect the layered bottlenecks. We collect a wake-up profile of threads, which samples an event that one thread wakes up another thread, and build a thread dependency graph to detect the layered bottlenecks. We implement our approach of profile-based detection of layered bottlenecks in the Go programming language. We demonstrate that our method can detect software bottlenecks limiting scalability and throughput of state-of-the-art middleware such as a web application server and a permissioned blockchain network, with small amount of the runtime overhead for profile collection.

由于软件瓶颈的分层结构，对阻碍硬件资源利用的软件瓶颈的检测是一个经典而复杂的问题。然而，基于模型的方法需要给定一个性能模型，这在当今的敏捷开发环境下是不切实际的，并且基于概要文件的方法不能处理软件瓶颈的分层结构。本文提出了一种从目标应用程序的执行配置文件中提取性能模型来检测分层瓶颈的新方法。我们收集线程的唤醒配置文件，其中采样一个线程唤醒另一个线程的事件，并构建线程依赖关系图来检测分层瓶颈。我们在Go编程语言中实现了基于配置文件的分层瓶颈检测方法。我们证明，我们的方法可以检测限制最先进的中间件(如web应用服务器和许可的区块链网络)的可扩展性和吞吐量的软件瓶颈，并且用于配置文件收集的运行时开销很小。

{"title":"Profile-based Detection of Layered Bottlenecks","authors":"T. Inagaki, Yohei Ueda, T. Nakaike, Moriyoshi Ohara","doi":"10.1145/3297663.3310296","DOIUrl":"https://doi.org/10.1145/3297663.3310296","url":null,"abstract":"Detection of software bottlenecks which hinder utilizing hardware resources is a classic but complex problem due to the layered structures of the software bottlenecks. However, model-based approaches require a performance model given, which is impractical to maintain under today's agile development environment, and profile-based approaches do not handle the layered structures of the software bottlenecks. This paper proposes a novel approach of taking the best of both worlds which extracts a performance model from execution profiles of the target application to detect the layered bottlenecks. We collect a wake-up profile of threads, which samples an event that one thread wakes up another thread, and build a thread dependency graph to detect the layered bottlenecks. We implement our approach of profile-based detection of layered bottlenecks in the Go programming language. We demonstrate that our method can detect software bottlenecks limiting scalability and throughput of state-of-the-art middleware such as a web application server and a permissioned blockchain network, with small amount of the runtime overhead for profile collection.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134368592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Mowgli

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310303

Daniel Seybold, M. Keppler, Daniel Gründler, Jörg Domaschka

Big Data and IoT applications require highly-scalable database management system (DBMS), preferably operated in the cloud to ensure scalability also on the resource level. As the number of existing distributed DBMS is extensive, the selection and operation of a distributed DBMS in the cloud is a challenging task. While DBMS benchmarking is a supportive approach, existing frameworks do not cope with the runtime constraints of distributed DBMS and the volatility of cloud environments. Hence, DBMS evaluation frameworks need to consider DBMS runtime and cloud resource constraints to enable portable and reproducible results. In this paper we present Mowgli, a novel evaluation framework that enables the evaluation of non-functional DBMS features in correlation with DBMS runtime and cloud resource constraints. Mowgli fully automates the execution of cloud and DBMS agnostic evaluation scenarios, including DBMS cluster adaptations. The evaluation of Mowgli is based on two IoT-driven scenarios, comprising the DBMSs Apache Cassandra and Couchbase, nine DBMS runtime configurations, two cloud providers with two different storage backends. Mowgli automates the execution of the resulting 102 evaluation scenarios, verifying its support for portable and reproducible DBMS evaluations. The results provide extensive insights into the DBMS scalability and the impact of different cloud resources. The significance of the results is validated by the correlation with existing DBMS evaluation results.

{"title":"Mowgli","authors":"Daniel Seybold, M. Keppler, Daniel Gründler, Jörg Domaschka","doi":"10.1145/3297663.3310303","DOIUrl":"https://doi.org/10.1145/3297663.3310303","url":null,"abstract":"Big Data and IoT applications require highly-scalable database management system (DBMS), preferably operated in the cloud to ensure scalability also on the resource level. As the number of existing distributed DBMS is extensive, the selection and operation of a distributed DBMS in the cloud is a challenging task. While DBMS benchmarking is a supportive approach, existing frameworks do not cope with the runtime constraints of distributed DBMS and the volatility of cloud environments. Hence, DBMS evaluation frameworks need to consider DBMS runtime and cloud resource constraints to enable portable and reproducible results. In this paper we present Mowgli, a novel evaluation framework that enables the evaluation of non-functional DBMS features in correlation with DBMS runtime and cloud resource constraints. Mowgli fully automates the execution of cloud and DBMS agnostic evaluation scenarios, including DBMS cluster adaptations. The evaluation of Mowgli is based on two IoT-driven scenarios, comprising the DBMSs Apache Cassandra and Couchbase, nine DBMS runtime configurations, two cloud providers with two different storage backends. Mowgli automates the execution of the resulting 102 evaluation scenarios, verifying its support for portable and reproducible DBMS evaluations. The results provide extensive insights into the DBMS scalability and the impact of different cloud resources. The significance of the results is validated by the correlation with existing DBMS evaluation results.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125153878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Profiling and Tracing Support for Java Applications 对Java应用程序的分析和跟踪支持

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309677

A. Nisbet, N. Nobre, G. Riley, M. Luján

We demonstrate the feasibility of undertaking performance evaluations for JVMs using: (1) a hybrid JVM/OS tool, such as async-profiler, (2) OS centric profiling and tracing tools based on Linux perf, and (3) the Extended Berkeley Packet Filter Tracing (eBPF) framework where we demonstrate the rationale behind the standard offwaketime tool, for analysing the causes of blocking latencies, and our own eBPF-based tool bcc-java, that relates changes in microarchitecture performance counter values to the execution of individual JVM and application threads at low overhead. The relative execution time overheads of the performance tools are illustrated for the DaCapo-bach-9.12 benchmarks with OpenJDK9 on an Intel Xeon E5-2690, running Ubuntu 16.04. Whereas sampling based tools can have up to 25% slowdown using 4kHz frequency, our tool bcc-java has a geometric mean of less than 5%. Only for the avrora benchmark, bcc-java has a significant overhead (37%) due to an unusually high number of futex system calls. Finally, we provide a discussion on the recommended approaches to solve specific performance use-case scenarios.

我们用以下方法演示了jvm性能评估的可行性:(1)混合JVM/OS工具，如async-profiler;(2)基于Linux perf的以OS为中心的分析和跟踪工具;(3)扩展伯克利包过滤跟踪(eBPF)框架，其中我们展示了标准off - waketime工具背后的基本原理，用于分析阻塞延迟的原因;以及我们自己的基于eBPF的工具bcc-java，该工具将微架构性能计数器值的变化与低开销的单个JVM和应用程序线程的执行联系起来。在运行Ubuntu 16.04的Intel Xeon E5-2690上使用OpenJDK9进行DaCapo-bach-9.12基准测试，说明了性能工具的相对执行时间开销。尽管基于采样的工具在使用4kHz频率时可以有高达25%的减速，但我们的工具bcc-java的几何平均值小于5%。只有在avrora基准测试中，由于异常多的futex系统调用，bc -java的开销很大(37%)。最后，我们讨论了解决特定性能用例场景的推荐方法。

引用次数: 6

Cachematic - Automatic Invalidation in Application-Level Caching Systems 缓存-应用程序级缓存系统中的自动失效

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3309666

V. Holmqvist, Jonathan Nilsfors, P. Leitner

Caching is a common method for improving the performance of modern web applications. Due to the varying architecture of web applications, and the lack of a standardized approach to cache management, ad-hoc solutions are common. These solutions tend to be hard to maintain as a code base grows, and are a common source of bugs. We present Cachematic, a general purpose application-level caching system with an au- tomatic cache management strategy. Cachematic provides a simple programming model, allowing developers to explic- itly denote a function as cacheable. The result of a cacheable function will transparently be cached without the developer having to worry about cache management. We present algo- rithms that automatically handle cache management, han- dling the cache dependency tree, and cache invalidation. Our experiments showed that the deployment of Cachematic decreased response time for read requests, compared to a manual cache management strategy for a representative case study conducted in collaboration with Bison, an US-based business intelligence company. We also found that, com- pared to the manual strategy, the cache hit rate was in- creased with a factor of around 1.64x. However, we observe a significant increase in response time for write requests. We conclude that automatic cache management as implemented in Cachematic is attractive for read-domminant use cases, but the substantial write overhead in our current proof-of- concept implementation represents a challenge.

缓存是提高现代web应用程序性能的常用方法。由于web应用程序的体系结构各不相同，并且缺乏标准化的缓存管理方法，因此临时解决方案很常见。随着代码库的增长，这些解决方案往往难以维护，并且是bug的常见来源。我们提出了一个通用的应用级缓存系统Cachematic，它具有自动缓存管理策略。Cachematic提供了一个简单的编程模型，允许开发人员显式地将函数表示为可缓存的。可缓存函数的结果将被透明地缓存，而开发人员不必担心缓存管理。我们提出了自动处理缓存管理、处理缓存依赖树和缓存失效的算法。我们的实验表明，在与美国商业智能公司Bison合作进行的一个代表性案例研究中，与手动缓存管理策略相比，部署Cachematic减少了读请求的响应时间。我们还发现，与手动策略相比，缓存命中率增加了约1.64倍。但是，我们观察到写请求的响应时间显著增加。我们得出的结论是，在Cachematic中实现的自动缓存管理对于以读为主的用例是有吸引力的，但是在我们当前的概念验证实现中，大量的写入开销代表了一个挑战。

{"title":"Cachematic - Automatic Invalidation in Application-Level Caching Systems","authors":"V. Holmqvist, Jonathan Nilsfors, P. Leitner","doi":"10.1145/3297663.3309666","DOIUrl":"https://doi.org/10.1145/3297663.3309666","url":null,"abstract":"Caching is a common method for improving the performance of modern web applications. Due to the varying architecture of web applications, and the lack of a standardized approach to cache management, ad-hoc solutions are common. These solutions tend to be hard to maintain as a code base grows, and are a common source of bugs. We present Cachematic, a general purpose application-level caching system with an au- tomatic cache management strategy. Cachematic provides a simple programming model, allowing developers to explic- itly denote a function as cacheable. The result of a cacheable function will transparently be cached without the developer having to worry about cache management. We present algo- rithms that automatically handle cache management, han- dling the cache dependency tree, and cache invalidation. Our experiments showed that the deployment of Cachematic decreased response time for read requests, compared to a manual cache management strategy for a representative case study conducted in collaboration with Bison, an US-based business intelligence company. We also found that, com- pared to the manual strategy, the cache hit rate was in- creased with a factor of around 1.64x. However, we observe a significant increase in response time for write requests. We conclude that automatic cache management as implemented in Cachematic is attractive for read-domminant use cases, but the substantial write overhead in our current proof-of- concept implementation represents a challenge.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116009193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Analyzing Data Structure Growth Over Time to Facilitate Memory Leak Detection 分析数据结构随时间的增长以促进内存泄漏检测

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310297

Markus Weninger, Elias Gander, H. Mössenböck

Memory leaks are a major threat in modern software systems. They occur if objects are unintentionally kept alive longer than necessary and are often indicated by continuously growing data structures. While there are various state-of-the-art memory monitoring tools, most of them share two critical shortcomings: (1) They have no knowledge about the monitored application's data structures and (2) they support no or only rudimentary analysis of the application's data structures over time. This paper encompasses novel techniques to tackle both of these drawbacks. It presents a domain-specific language (DSL) that allows users to describe arbitrary data structures, as well as an algorithm to detect instances of these data structures in reconstructed heaps. In addition, we propose techniques and metrics to analyze and measure the evolution of data structure instances over time. This allows us to identify those instances that are most likely involved in a memory leak. These concepts have been integrated into AntTracks, a trace-based memory monitoring tool. We present our approach to detect memory leaks in several real-world applications, showing its applicability and feasibility.

内存泄漏是现代软件系统中的一个主要威胁。如果对象的存活时间无意中超过了必要的时间，并且通常通过不断增长的数据结构来表示，就会出现这种情况。虽然有各种最先进的内存监视工具，但它们中的大多数都有两个严重的缺点:(1)它们不了解被监视应用程序的数据结构;(2)它们不支持或只支持对应用程序数据结构的基本分析。本文包含了解决这两个缺点的新技术。它提供了一种领域特定语言(DSL)，允许用户描述任意数据结构，并提供了一种算法来检测重构堆中这些数据结构的实例。此外，我们还提出了分析和度量数据结构实例随时间演变的技术和指标。这使我们能够识别那些最有可能涉及内存泄漏的实例。这些概念已经集成到AntTracks，一个基于跟踪的内存监控工具。我们介绍了在几个实际应用程序中检测内存泄漏的方法，展示了它的适用性和可行性。

{"title":"Analyzing Data Structure Growth Over Time to Facilitate Memory Leak Detection","authors":"Markus Weninger, Elias Gander, H. Mössenböck","doi":"10.1145/3297663.3310297","DOIUrl":"https://doi.org/10.1145/3297663.3310297","url":null,"abstract":"Memory leaks are a major threat in modern software systems. They occur if objects are unintentionally kept alive longer than necessary and are often indicated by continuously growing data structures. While there are various state-of-the-art memory monitoring tools, most of them share two critical shortcomings: (1) They have no knowledge about the monitored application's data structures and (2) they support no or only rudimentary analysis of the application's data structures over time. This paper encompasses novel techniques to tackle both of these drawbacks. It presents a domain-specific language (DSL) that allows users to describe arbitrary data structures, as well as an algorithm to detect instances of these data structures in reconstructed heaps. In addition, we propose techniques and metrics to analyze and measure the evolution of data structure instances over time. This allows us to identify those instances that are most likely involved in a memory leak. These concepts have been integrated into AntTracks, a trace-based memory monitoring tool. We present our approach to detect memory leaks in several real-world applications, showing its applicability and feasibility.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121607834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Performance Modelling of an Anonymous and Failure Resilient Fair-Exchange E-Commerce Protocol 一种匿名和故障弹性公平交换电子商务协议的性能建模

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310310

Ohud Almutairi, N. Thomas

This paper explores a type of non-repudiation protocol, called an anonymous and failure resilient fair-exchange e-commerce protocol, which guarantees a fair-exchange between two parties in an e-commerce environment. Models are formulated using the PEPA formalism to investigate the performance overheads introduced by the security properties and behaviour of the protocol. The PEPA eclipse plug-in is used to support the creation of the PEPA models for the security protocol and the automatic calculation of the performance measures identified for the protocol models.

本文探讨了一种不可否认协议，称为匿名和故障弹性公平交换电子商务协议，它保证了电子商务环境中双方之间的公平交换。模型使用PEPA形式化来研究由协议的安全属性和行为引入的性能开销。PEPA eclipse插件用于支持为安全协议创建PEPA模型，以及为协议模型标识的性能度量的自动计算。

引用次数: 2

Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures 异构CPU-FPGA架构协同执行策略分析与建模

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310305

Sitao Huang, Li-Wen Chang, I. E. Hajj, Simon Garcia De Gonzalo, Juan Gómez-Luna, S. R. Chalamalasetti, Mohamed El-Hadedy, D. Milojicic, O. Mutlu, Deming Chen, Wen-mei W. Hwu

Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2x. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).

异构CPU-FPGA系统正朝着cpu和fpga之间更紧密的集成发展，以提高性能和能源效率。与此同时，可编程性也随着高级合成工具(例如，OpenCL软件开发工具包)而得到改善，这些工具允许程序员用高级编程语言表达他们的设计，并避免耗时和易出错的寄存器传输级(RTL)编程。在传统的松耦合加速器模式中，FPGA作为卸载加速器工作，其中整个内核在FPGA上运行，而CPU线程等待结果。然而，cpu和fpga的紧密集成使得细粒度协同执行成为可能，也就是说，让两个设备在相同的工作负载上并发工作。这种协同执行通过同时利用CPU线程和FPGA并发性，更好地利用了整个系统资源，从而获得更高的性能。在本文中，我们探索了使用OpenCL高级合成在cpu和fpga之间协同执行的潜力。首先，我们比较了各种协作技术(即数据分区和任务分区)，并评估了它们之间的权衡。我们观察到，选择最合适的分区策略可以将性能提高2倍。其次，我们研究了一种常见的优化技术，内核复制，在协同CPU-FPGA环境中的影响。我们表明，一般趋势是内核复制在内存带宽饱和之前会提高性能。第三，我们为应用程序开发人员在设计CPU-FPGA协作应用程序时选择不同的分区策略提供了新的见解。我们发现不同的分区策略会带来不同的权衡(例如，任务分区允许更多的内核复制，而数据分区具有更低的通信开销和更好的负载平衡)，但它们通常优于传统的CPU-FPGA系统，其中不使用协作执行策略。因此，我们主张在未来的异构CPU-FPGA系统中实现更多的集成(例如，OpenCL 2.0特性，如细粒度共享虚拟内存)。

{"title":"Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures","authors":"Sitao Huang, Li-Wen Chang, I. E. Hajj, Simon Garcia De Gonzalo, Juan Gómez-Luna, S. R. Chalamalasetti, Mohamed El-Hadedy, D. Milojicic, O. Mutlu, Deming Chen, Wen-mei W. Hwu","doi":"10.1145/3297663.3310305","DOIUrl":"https://doi.org/10.1145/3297663.3310305","url":null,"abstract":"Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2x. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130162733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

A Study of Core Utilization and Residency in Heterogeneous Smart Phone Architectures 异构智能手机架构中的核心利用与驻留研究

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pub Date : 2019-04-04 DOI: 10.1145/3297663.3310304

Joseph Whitehouse, Qinzhe Wu, Shuang Song, E. John, A. Gerstlauer, L. John

In recent years, the smart phone platform has seen a rise in the number of cores and the use of heterogeneous clusters as in the Qualcomm Snapdragon, Apple A10 and the Samsung Exynos processors. This paper attempts to understand characteristics of mobile workloads, with measurements on heterogeneous multicore phone platforms with big and little cores. It answers questions such as the following: (i) Do smart phones need multiple cores of different types (eg: big or little)? (ii) Is it energy-efficient to operate with more cores (with less time) or fewer cores even if it might take longer? (iii)What are the best frequencies to operate the cores considering energy efficiency? (iv) Do mobile applications need out-of-order speculative execution cores with complex branch prediction? (v) Is IPC a good performance indicator for early design tradeoff evaluation while working on mobile processor design? Using Geekbench and more than 3 dozen Android applications, and the Workload Automation tool from ARM, we measure core utilization, frequency residencies, and energy efficiency characteristics on two leading edge smart phones. Many characteristics of smartphone platforms are presented, and architectural implications of the observations as well as design considerations for future mobile processors are discussed. A key insight is that multiple big and complex cores are beneficial both from a performance as well as an energy point of view in certain scenarios. It is seen that 4 big cores are utilized during application launch and update phases of applications. Similarly, reboot using all 4 cores at maximum performance provides latency advantages. However, it consumes higher power and energy, and reboot with 2 cores was seen to be more energy efficient than reboot with 1 or 4 cores. Furthermore, inaccurate branch prediction is seen to result in up to 40% mis-speculated instructions in many applications, suggesting that it is important to improve the accuracy of branch predictors in mobile processors. While absolute IPCs are observed to be a poor predictor of benchmark scores, relative IPCs are useful for estimating the impact of microarchitectural changes on benchmark scores.

近年来，智能手机平台的核心数量和异构集群的使用有所增加，如高通骁龙、苹果A10和三星Exynos处理器。本文试图通过对大核和小核异构多核手机平台的测量来了解移动工作负载的特征。它回答了以下问题:(i)智能手机是否需要不同类型的多个内核(例如:大的还是小的)?(ii)使用更多核(更少时间)或使用更少核(即使可能需要更长的时间)是否更节能?(iii)考虑到能源效率，运行核心的最佳频率是什么?(iv)移动应用是否需要无序的推测执行核心和复杂的分支预测?(v)在移动处理器设计时，IPC是否是早期设计权衡评估的良好性能指标?使用Geekbench和30多个Android应用程序，以及ARM的Workload Automation工具，我们测量了两款前沿智能手机的核心利用率、频率驻留和能效特征。介绍了智能手机平台的许多特征，并讨论了观察结果的架构含义以及对未来移动处理器的设计考虑。一个关键的见解是，在某些情况下，从性能和能量的角度来看，多个大而复杂的核心都是有益的。可以看出，在应用程序的启动和更新阶段，使用了4个大内核。类似地，在最大性能下使用所有4个核心进行重启，可以提供延迟优势。然而，它消耗更高的功率和能量，2核重启被认为比1核或4核重启更节能。此外，在许多应用中，不准确的分支预测会导致高达40%的错误推测指令，这表明提高移动处理器中分支预测器的准确性非常重要。虽然绝对ipc不能很好地预测基准分数，但相对ipc对于估计微架构更改对基准分数的影响是有用的。

{"title":"A Study of Core Utilization and Residency in Heterogeneous Smart Phone Architectures","authors":"Joseph Whitehouse, Qinzhe Wu, Shuang Song, E. John, A. Gerstlauer, L. John","doi":"10.1145/3297663.3310304","DOIUrl":"https://doi.org/10.1145/3297663.3310304","url":null,"abstract":"In recent years, the smart phone platform has seen a rise in the number of cores and the use of heterogeneous clusters as in the Qualcomm Snapdragon, Apple A10 and the Samsung Exynos processors. This paper attempts to understand characteristics of mobile workloads, with measurements on heterogeneous multicore phone platforms with big and little cores. It answers questions such as the following: (i) Do smart phones need multiple cores of different types (eg: big or little)? (ii) Is it energy-efficient to operate with more cores (with less time) or fewer cores even if it might take longer? (iii)What are the best frequencies to operate the cores considering energy efficiency? (iv) Do mobile applications need out-of-order speculative execution cores with complex branch prediction? (v) Is IPC a good performance indicator for early design tradeoff evaluation while working on mobile processor design? Using Geekbench and more than 3 dozen Android applications, and the Workload Automation tool from ARM, we measure core utilization, frequency residencies, and energy efficiency characteristics on two leading edge smart phones. Many characteristics of smartphone platforms are presented, and architectural implications of the observations as well as design considerations for future mobile processors are discussed. A key insight is that multiple big and complex cores are beneficial both from a performance as well as an energy point of view in certain scenarios. It is seen that 4 big cores are utilized during application launch and update phases of applications. Similarly, reboot using all 4 cores at maximum performance provides latency advantages. However, it consumes higher power and energy, and reboot with 2 cores was seen to be more energy efficient than reboot with 1 or 4 cores. Furthermore, inaccurate branch prediction is seen to result in up to 40% mis-speculated instructions in many applications, suggesting that it is important to improve the accuracy of branch predictors in mobile processors. While absolute IPCs are observed to be a poor predictor of benchmark scores, relative IPCs are useful for estimating the impact of microarchitectural changes on benchmark scores.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2