2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第2页

Characterization and analysis of a web search benchmark 网络搜索基准的特征和分析

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095818

Zacharias Hadjilambrou, Marios Kleanthous, Yiannakis Sazeides

Web search as a service is very impressive. Web search runs on thousands of servers which perform search on an index of billions of web pages. The search results must be both relevant to the user queries and reach the user in a fraction of a second. A web search service must guarantee the same QoS at all times even at the peak incoming traffic load. Not unjustifiably the web search service has attracted a lot of research attention. Despite the high research interest web search has gained, there are still plenty unknown about the functionality and the architecture of web search benchmarks. Much research has been done using commercial web search engines, like Bing or Google, but many details of these search engines are, of course, not disclosed to the public. We take an academically accepted web search benchmark and we perform a thorough characterization and analysis of it. We shed light in to the architecture and the functionality of the benchmark. We also investigate some prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput and we also explore the potential use of low power servers for web search. Our results show that intra-server partitioning can reduce tail latencies and that low power servers given enough partitioning can provide same response times as conventional high performance servers.

Web搜索作为一种服务是非常令人印象深刻的。网络搜索在数千台服务器上运行，这些服务器对数十亿个网页的索引进行搜索。搜索结果必须与用户查询相关，并在几分之一秒内到达用户。web搜索服务必须在任何时候都保证相同的QoS，即使是在最高的传入流量负载下。毫无疑问，网络搜索服务吸引了大量的研究关注。尽管网络搜索已经获得了很高的研究兴趣，但关于网络搜索基准的功能和架构仍然有很多未知的东西。使用商业网络搜索引擎(如Bing或Google)进行了大量研究，但这些搜索引擎的许多细节当然没有向公众披露。我们采用学术上公认的网络搜索基准，并对其进行彻底的表征和分析。我们介绍了基准的架构和功能。我们还研究了一些突出的网络搜索研究问题。特别是，我们研究了服务器内部索引分区如何影响响应时间和吞吐量，我们还探索了低功耗服务器用于web搜索的潜在用途。我们的研究结果表明，服务器内部分区可以减少尾部延迟，并且给予足够分区的低功耗服务器可以提供与传统高性能服务器相同的响应时间。

{"title":"Characterization and analysis of a web search benchmark","authors":"Zacharias Hadjilambrou, Marios Kleanthous, Yiannakis Sazeides","doi":"10.1109/ISPASS.2015.7095818","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095818","url":null,"abstract":"Web search as a service is very impressive. Web search runs on thousands of servers which perform search on an index of billions of web pages. The search results must be both relevant to the user queries and reach the user in a fraction of a second. A web search service must guarantee the same QoS at all times even at the peak incoming traffic load. Not unjustifiably the web search service has attracted a lot of research attention. Despite the high research interest web search has gained, there are still plenty unknown about the functionality and the architecture of web search benchmarks. Much research has been done using commercial web search engines, like Bing or Google, but many details of these search engines are, of course, not disclosed to the public. We take an academically accepted web search benchmark and we perform a thorough characterization and analysis of it. We shed light in to the architecture and the functionality of the benchmark. We also investigate some prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput and we also explore the potential use of low power servers for web search. Our results show that intra-server partitioning can reduce tail latencies and that low power servers given enough partitioning can provide same response times as conventional high performance servers.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126071641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Where does the time go? characterizing tail latency in memcached 时间都去哪儿了?表征memcached中的尾部延迟

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095781

G. Blake, A. Saidi

To function correctly Online, Data-Intensive (OLDI) services require low and consistent service times. Maintaining predictable service times entails requiring 99th or higher percentile latency targets across hundreds to thousands of servers in the data-center. However, to maintain the 99th percentile targets servers are routinely run well below full utilization. The main difficulty in optimizing a server to run closer to peak utilization and maintain predictable 99th percentile response latencies is identifying and mitigating the causes of a request missing the target service time. In practice this analysis is challenging as requests and responses overlap their execution with respect to one another and traverse multiple layers of software, user/kernel protection boundaries, and the hardware/software divide. Traditional profiling methods that record the time being spent in each function usually yield few clues as to where a bottleneck may be present due to the many layers of software each consuming only a small fraction of time each. In this work we analyze the end-to-end sources of latency in a Memcached server from the wire through the kernel into the application and back again. To do so, we develop a tool that utilizes the Linux SystemTap infrastructure to measure latency throughout the many software layers that make up the complete request and response path for Memcached. While memory copies and the Linux networking stack are often suggested as major contributors to latency, we find that the main cause of missing response latency guarantees is the formation of standing queues and the application's inability to detect and remedy this situation.

为了在线正常工作，数据密集型(OLDI)业务需要较低且一致的服务时间。维护可预测的服务时间需要在数据中心的数百到数千台服务器上设置99%或更高百分比的延迟目标。然而，为了维护第99个百分位数的目标，服务器通常在完全利用率之下运行。优化服务器以使其运行更接近峰值利用率并保持可预测的第99百分位响应延迟的主要困难是识别和减轻请求错过目标服务时间的原因。在实践中，这种分析是具有挑战性的，因为请求和响应的执行相互重叠，并且跨越多个软件层、用户/内核保护边界和硬件/软件分界线。传统的分析方法记录了在每个功能上花费的时间，由于软件的许多层每个只消耗一小部分时间，因此通常很少产生关于瓶颈可能存在的线索。在本文中，我们将分析Memcached服务器中从连接到内核到应用程序再返回的端到端延迟源。为此，我们开发了一个工具，该工具利用Linux SystemTap基础设施来测量构成Memcached完整请求和响应路径的许多软件层中的延迟。虽然通常认为内存副本和Linux网络堆栈是造成延迟的主要原因，但我们发现，缺少响应延迟保证的主要原因是排队的形成以及应用程序无法检测和纠正这种情况。

{"title":"Where does the time go? characterizing tail latency in memcached","authors":"G. Blake, A. Saidi","doi":"10.1109/ISPASS.2015.7095781","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095781","url":null,"abstract":"To function correctly Online, Data-Intensive (OLDI) services require low and consistent service times. Maintaining predictable service times entails requiring 99th or higher percentile latency targets across hundreds to thousands of servers in the data-center. However, to maintain the 99th percentile targets servers are routinely run well below full utilization. The main difficulty in optimizing a server to run closer to peak utilization and maintain predictable 99th percentile response latencies is identifying and mitigating the causes of a request missing the target service time. In practice this analysis is challenging as requests and responses overlap their execution with respect to one another and traverse multiple layers of software, user/kernel protection boundaries, and the hardware/software divide. Traditional profiling methods that record the time being spent in each function usually yield few clues as to where a bottleneck may be present due to the many layers of software each consuming only a small fraction of time each. In this work we analyze the end-to-end sources of latency in a Memcached server from the wire through the kernel into the application and back again. To do so, we develop a tool that utilizes the Linux SystemTap infrastructure to measure latency throughout the many software layers that make up the complete request and response path for Memcached. While memory copies and the Linux networking stack are often suggested as major contributors to latency, we find that the main cause of missing response latency guarantees is the formation of standing queues and the application's inability to detect and remedy this situation.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127353194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

DELPHI: a framework for RTL-based architecture design evaluation using DSENT models DELPHI:一个使用DSENT模型进行基于rtl的架构设计评估的框架

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095780

Michael Papamichael, Cagla Cakir, Chen Sun, C. Chen, J. Hoe, K. Mai, L. Peh, V. Stojanović

Computer architects are increasingly interested in evaluating their ideas at the register-transfer level (RTL) to gain more precise insights on the key characteristics (frequency, area, power) of a micro/architectural design proposal. However, the RTL synthesis process is notoriously tedious, slow, and errorprone and is often outside the area of expertise of a typical computer architect, as it requires familiarity with complex CAD flows, hard-to-get tools and standard cell libraries. The effort is further multiplied when targeting multiple technology nodes and standard cell variants to study technology dependence. This paper presents DELPHI, a flexible, open framework that leverages the DSENT modeling engine for faster, easier, and more efficient characterization of RTL hardware designs. DELPHI first synthesizes a Verilog or VHDL RTL design (either using the industry-standard Synopsys Design Compiler tool or a combination of open-source tools) to an intermediate structural netlist. It then processes the resulting synthesized netlist to generate a technology-independent DSENT design model. This model can then be used within a modified version of the DSENT flow to perform very fast-one to two orders of magnitude faster than full RTL synthesis-estimation of hardware performance characteristics, such as frequency, area, and power across a variety of DSENT technology models (e.g., 65nm Bulk, 32nm SOI, 11nm Tri-Gate, etc.). In our evaluation using 26 RTL design examples, DELPHI and DSENT were consistently able to closely track and capture design trends of conventional RTL synthesis results without the associated delay and complexity. We are releasing the full DELPHI framework (including a fully open-source flow) at http://www.ece.cmu.edu/CALCM/delphi/.

计算机架构师越来越有兴趣在寄存器-传输级别(RTL)评估他们的想法，以获得对微/架构设计提案的关键特征(频率，面积，功率)的更精确的见解。然而，RTL合成过程是出了名的乏味、缓慢和容易出错，并且通常超出了典型计算机架构师的专业领域，因为它需要熟悉复杂的CAD流、难以获得的工具和标准单元库。当针对多个技术节点和标准细胞变体来研究技术依赖性时，工作量会进一步增加。本文介绍了DELPHI，一个灵活的开放框架，利用DSENT建模引擎更快，更容易，更有效地表征RTL硬件设计。DELPHI首先将Verilog或VHDL RTL设计(使用行业标准的Synopsys设计编译器工具或开源工具的组合)合成为中间结构网表。然后，它处理生成的综合网表，以生成与技术无关的DSENT设计模型。然后，该模型可以在DSENT流程的修改版本中使用，以执行非常快的速度-比完全RTL合成快一到两个数量级-对各种DSENT技术模型(例如，65nm Bulk, 32nm SOI, 11nm Tri-Gate等)的硬件性能特征进行估计，例如频率，面积和功率。在我们对26个RTL设计实例的评估中，DELPHI和DSENT始终能够密切跟踪和捕获传统RTL综合结果的设计趋势，而不会产生相关的延迟和复杂性。我们在http://www.ece.cmu.edu/CALCM/delphi/上发布了完整的DELPHI框架(包括一个完全开源的流程)。

{"title":"DELPHI: a framework for RTL-based architecture design evaluation using DSENT models","authors":"Michael Papamichael, Cagla Cakir, Chen Sun, C. Chen, J. Hoe, K. Mai, L. Peh, V. Stojanović","doi":"10.1109/ISPASS.2015.7095780","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095780","url":null,"abstract":"Computer architects are increasingly interested in evaluating their ideas at the register-transfer level (RTL) to gain more precise insights on the key characteristics (frequency, area, power) of a micro/architectural design proposal. However, the RTL synthesis process is notoriously tedious, slow, and errorprone and is often outside the area of expertise of a typical computer architect, as it requires familiarity with complex CAD flows, hard-to-get tools and standard cell libraries. The effort is further multiplied when targeting multiple technology nodes and standard cell variants to study technology dependence. This paper presents DELPHI, a flexible, open framework that leverages the DSENT modeling engine for faster, easier, and more efficient characterization of RTL hardware designs. DELPHI first synthesizes a Verilog or VHDL RTL design (either using the industry-standard Synopsys Design Compiler tool or a combination of open-source tools) to an intermediate structural netlist. It then processes the resulting synthesized netlist to generate a technology-independent DSENT design model. This model can then be used within a modified version of the DSENT flow to perform very fast-one to two orders of magnitude faster than full RTL synthesis-estimation of hardware performance characteristics, such as frequency, area, and power across a variety of DSENT technology models (e.g., 65nm Bulk, 32nm SOI, 11nm Tri-Gate, etc.). In our evaluation using 26 RTL design examples, DELPHI and DSENT were consistently able to closely track and capture design trends of conventional RTL synthesis results without the associated delay and complexity. We are releasing the full DELPHI framework (including a fully open-source flow) at http://www.ece.cmu.edu/CALCM/delphi/.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128279220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Mosaic: cross-platform user-interaction record and replay for the fragmented android ecosystem Mosaic:面向碎片化android生态系统的跨平台用户交互记录和回放

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095807

Matthew Halpern, Yuhao Zhu, R. Peri, V. Reddi

In contrast to traditional computing systems, such as desktops and servers, that are programmed to perform “compute-bound” and “run-to-completion” tasks, mobile applications are designed for user interactivity. Factoring user interactivity into computer system design and evaluation is important, yet possesses many challenges. In particular, systematically studying interactive mobile applications across the diverse set of mobile devices available today is difficult due to the mobile device fragmentation problem. At the time of writing, there are 18,796 distinct Android mobile devices on the market and will only continue to increase in the future. Differences in screen sizes, resolutions and operating systems impose different interactivity requirements, making it difficult to uniformly study these systems. We present Mosaic, a cross-platform, timing-accurate record and replay tool for Android-based mobile devices. Mosaic overcomes device fragmentation through a novel virtual screen abstraction. User interactions are translated from a physical device into a platform-agnostic intermediate representation before translation to a target system. The intermediate representation is human-readable, which allows Mosaic users to modify previously recorded traces or even synthesize their own user interactive sessions from scratch. We demonstrate that Mosaic allows user interaction traces to be recorded on emulators, smartphones, tablets, and development boards and replayed on other devices. Using Mosaic we were able to replay 45 different Google Play applications across multiple devices, and also show that we can perform cross-platform performance comparisons between two different processors under identical user interactions.

传统的计算系统，如桌面和服务器，被编程为执行“计算约束”和“运行到完成”的任务，与之相反，移动应用程序是为用户交互性而设计的。将用户交互性纳入计算机系统的设计和评估是很重要的，但也面临着许多挑战。特别是，由于移动设备碎片化问题，系统地研究跨各种可用移动设备的交互式移动应用程序是很困难的。在撰写本文时，市场上有18,796种不同的Android移动设备，未来只会继续增加。不同的屏幕尺寸、分辨率和操作系统会产生不同的交互性要求，因此很难统一地研究这些系统。我们提出Mosaic，一个跨平台，定时准确的记录和重播工具，基于android的移动设备。Mosaic通过一种新颖的虚拟屏幕抽象来克服设备碎片化。在转换到目标系统之前，用户交互从物理设备转换为与平台无关的中间表示。中间表示是人类可读的，这允许Mosaic用户修改以前记录的轨迹，甚至从头合成他们自己的用户交互会话。我们证明马赛克允许用户交互痕迹被记录在模拟器，智能手机，平板电脑和开发板上，并在其他设备上重播。使用Mosaic，我们能够在多个设备上重播45个不同的b谷歌Play应用程序，并且还表明我们可以在相同的用户交互下在两个不同的处理器之间执行跨平台性能比较。

{"title":"Mosaic: cross-platform user-interaction record and replay for the fragmented android ecosystem","authors":"Matthew Halpern, Yuhao Zhu, R. Peri, V. Reddi","doi":"10.1109/ISPASS.2015.7095807","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095807","url":null,"abstract":"In contrast to traditional computing systems, such as desktops and servers, that are programmed to perform “compute-bound” and “run-to-completion” tasks, mobile applications are designed for user interactivity. Factoring user interactivity into computer system design and evaluation is important, yet possesses many challenges. In particular, systematically studying interactive mobile applications across the diverse set of mobile devices available today is difficult due to the mobile device fragmentation problem. At the time of writing, there are 18,796 distinct Android mobile devices on the market and will only continue to increase in the future. Differences in screen sizes, resolutions and operating systems impose different interactivity requirements, making it difficult to uniformly study these systems. We present Mosaic, a cross-platform, timing-accurate record and replay tool for Android-based mobile devices. Mosaic overcomes device fragmentation through a novel virtual screen abstraction. User interactions are translated from a physical device into a platform-agnostic intermediate representation before translation to a target system. The intermediate representation is human-readable, which allows Mosaic users to modify previously recorded traces or even synthesize their own user interactive sessions from scratch. We demonstrate that Mosaic allows user interaction traces to be recorded on emulators, smartphones, tablets, and development boards and replayed on other devices. Using Mosaic we were able to replay 45 different Google Play applications across multiple devices, and also show that we can perform cross-platform performance comparisons between two different processors under identical user interactions.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130436273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

Factors affecting scalability of multithreaded Java applications on manycore systems 影响多核系统上多线程Java应用程序可伸缩性的因素

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095800

Junjie Qian, Du Li, W. Srisa-an, Hong Jiang, S. Seth

Modern Java applications employ multithreading to improve performance by harnessing execution parallelism available in today's multicore processors. However, as the numbers of threads and processing cores are scaled up, many applications do not achieve the desired level of performance improvement. In this paper, we explore two factors, lock contention and garbage collection performance that can affect scalability of Java applications. Our initial result reveals two new observations. First, applications that are highly scalable may experience more instances of lock contention than those experienced by applications that are less scalable. Second, efficient multithreading can make garbage collection less effective, and therefore, negatively impacting garbage collection performance.

现代Java应用程序使用多线程，通过利用当今多核处理器中可用的执行并行性来提高性能。然而，随着线程和处理核心数量的增加，许多应用程序并没有达到期望的性能改进水平。在本文中，我们探讨了影响Java应用程序可伸缩性的两个因素，锁争用和垃圾收集性能。我们的初步结果揭示了两个新的观察结果。首先，高度可伸缩的应用程序可能比可伸缩程度较低的应用程序遇到更多的锁争用实例。其次，高效的多线程会降低垃圾收集的效率，从而对垃圾收集的性能产生负面影响。

引用次数: 7

A modeling framework for reuse distance-based estimation of cache performance 基于重用距离的缓存性能估计建模框架

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095785

Xiaoyue Pan, B. Jonsson

We develop an analytical modeling framework for efficient prediction of cache miss ratios based on reuse distance distributions. The only input needed for our predictions is the reuse distance distribution of a program execution: previous work has shown that they can be obtained with very small overhead by sampling from native executions. This should be contrasted with previous approaches that base predictions on stack distance distributions, whose collection need significantly larger overhead or additional hardware support. The predictions are based on a uniform modeling framework which can be specialized for a variety of cache replacement policies, including Random, LRU, PLRU, and MRU (aka. bit-PLRU), and for arbitrary values of cache size and cache associativity. We evaluate our modeling framework with the SPEC CPU 2006 benchmark suite over a set of cache configurations with varying cache size, associativity and replacement policy. The introduced inaccuracies were generally below 1% for the model of the policy, and additionally around 2% when set-local reuse distances must be estimated from global reuse distance distributions. The inaccuracy introduced by sampling is significantly smaller.

我们开发了一个基于重用距离分布的有效预测缓存缺失率的分析建模框架。我们预测所需的唯一输入是程序执行的重用距离分布:以前的工作表明，通过从本机执行中抽样，可以以非常小的开销获得它们。这应该与以前基于堆栈距离分布的预测方法形成对比，后者的收集需要更大的开销或额外的硬件支持。预测基于统一的建模框架，该框架可以专门用于各种缓存替换策略，包括Random, LRU, PLRU和MRU。bit-PLRU)，以及缓存大小和缓存关联性的任意值。我们使用SPEC CPU 2006基准测试套件对一组缓存配置进行评估，这些配置具有不同的缓存大小、关联性和替换策略。对于策略模型，引入的不准确性通常低于1%，当必须从全局重用距离分布估计集局部重用距离时，引入的不准确性约为2%。由抽样引入的不准确性明显更小。

{"title":"A modeling framework for reuse distance-based estimation of cache performance","authors":"Xiaoyue Pan, B. Jonsson","doi":"10.1109/ISPASS.2015.7095785","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095785","url":null,"abstract":"We develop an analytical modeling framework for efficient prediction of cache miss ratios based on reuse distance distributions. The only input needed for our predictions is the reuse distance distribution of a program execution: previous work has shown that they can be obtained with very small overhead by sampling from native executions. This should be contrasted with previous approaches that base predictions on stack distance distributions, whose collection need significantly larger overhead or additional hardware support. The predictions are based on a uniform modeling framework which can be specialized for a variety of cache replacement policies, including Random, LRU, PLRU, and MRU (aka. bit-PLRU), and for arbitrary values of cache size and cache associativity. We evaluate our modeling framework with the SPEC CPU 2006 benchmark suite over a set of cache configurations with varying cache size, associativity and replacement policy. The introduced inaccuracies were generally below 1% for the model of the policy, and additionally around 2% when set-local reuse distances must be estimated from global reuse distance distributions. The inaccuracy introduced by sampling is significantly smaller.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"454 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134497351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Precise computer comparisons via statistical resampling methods 通过统计重采样方法进行精确的计算机比较

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095787

Bin Li, Shaoming Chen, Lu Peng

Performance variability, stemming from non-deterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering variability. This may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g. ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance and a five-number-summary to summarize computer performance. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples. Results show that our methods are precise and robust even when two computers have very similar performance metrics.

性能可变性，源于不确定的硬件和软件行为或确定性行为，如测量偏差，是计算机系统中众所周知的现象，它增加了比较计算机性能指标的难度。传统方法使用各种度量(如几何平均值)来量化不同基准的性能，以比较计算机，而不考虑可变性。这可能会导致错误的结论。在本文中，我们提出了三种用于性能评估和比较的重抽样方法:用于两台计算机之间一般性能比较的随机化检验、自举置信度估计和用于性能评估的经验分布和五数汇总。结果表明:1)随机化检验大大提高了我们在差异不大的情况下识别性能比较差异的机会;2)自举置信估计为性能比较度量(如几何均值比)提供准确的置信区间;3)当差异非常小时，单次测试往往不足以揭示计算机性能的本质，而用五数总结来总结计算机性能。我们通过详细的蒙特卡罗模拟研究和实例来说明结果和结论。结果表明，即使两台计算机具有非常相似的性能指标，我们的方法也是精确和鲁棒的。

{"title":"Precise computer comparisons via statistical resampling methods","authors":"Bin Li, Shaoming Chen, Lu Peng","doi":"10.1109/ISPASS.2015.7095787","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095787","url":null,"abstract":"Performance variability, stemming from non-deterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering variability. This may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g. ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance and a five-number-summary to summarize computer performance. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples. Results show that our methods are precise and robust even when two computers have very similar performance metrics.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116818624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Pydgin: generating fast instruction set simulators from simple architecture descriptions with meta-tracing JIT compilers Pydgin:使用元跟踪JIT编译器从简单的架构描述生成快速指令集模拟器

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095811

Derek Lockhart, Berkin Ilbeyi, C. Batten

Instruction set simulators (ISSs) remain an essential tool for the rapid exploration and evaluation of instruction set extensions in both academia and industry. Due to their importance in both hardware and software design, modern ISSs must balance a tension between developer productivity and high-performance simulation. Productivity requirements have led to “ADL-driven” toolflows that automatically generate ISSs from high-level architectural description languages (ADLs). Meanwhile, performance requirements have prompted ISSs to incorporate increasingly complicated dynamic binary translation (DBT) techniques. Construction of frameworks capable of providing both the productivity benefits of ADL-generated simulators and the performance benefits of DBT remains a significant challenge. We introduce Pydgin, a new approach to ISS construction that addresses the multiple challenges of designing, implementing, and maintaining ADL-generated DBT-ISSs. Pydgin uses a Python-based, embedded-ADL to succinctly describe instruction behavior as directly executable “pseudocode”. These Pydgin ADL descriptions are used to automatically generate high-performance DBT-ISSs by creatively adapting an existing meta-tracing JIT compilation framework designed for general-purpose dynamic programming languages. We demonstrate the capabilities of Pydgin by implementing ISSs for two instruction sets and show that Pydgin provides concise, flexible ISA descriptions while also generating simulators with performance comparable to hand-coded DBT-ISSs.

指令集模拟器(ISSs)仍然是学术界和工业界快速探索和评估指令集扩展的重要工具。由于它们在硬件和软件设计中的重要性，现代iss必须平衡开发人员生产力和高性能仿真之间的紧张关系。生产力需求导致了“adl驱动”的工具流，这些工具流可以从高级体系结构描述语言(adl)自动生成iss。同时，性能要求促使iss采用越来越复杂的动态二进制转换(DBT)技术。构建既能提供adl生成的模拟器的生产力优势又能提供DBT的性能优势的框架仍然是一个重大挑战。我们介绍Pydgin，这是一种构建ISS的新方法，它解决了设计、实现和维护adl生成的dbt -ISS的多重挑战。Pydgin使用基于python的嵌入式adl来简洁地将指令行为描述为直接可执行的“伪代码”。这些Pydgin ADL描述通过创造性地采用为通用动态编程语言设计的现有元跟踪JIT编译框架，用于自动生成高性能dbt - iss。我们通过为两个指令集实现iss来演示Pydgin的功能，并展示Pydgin提供了简洁、灵活的ISA描述，同时还生成了性能与手工编码的dbt - iss相当的模拟器。

{"title":"Pydgin: generating fast instruction set simulators from simple architecture descriptions with meta-tracing JIT compilers","authors":"Derek Lockhart, Berkin Ilbeyi, C. Batten","doi":"10.1109/ISPASS.2015.7095811","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095811","url":null,"abstract":"Instruction set simulators (ISSs) remain an essential tool for the rapid exploration and evaluation of instruction set extensions in both academia and industry. Due to their importance in both hardware and software design, modern ISSs must balance a tension between developer productivity and high-performance simulation. Productivity requirements have led to “ADL-driven” toolflows that automatically generate ISSs from high-level architectural description languages (ADLs). Meanwhile, performance requirements have prompted ISSs to incorporate increasingly complicated dynamic binary translation (DBT) techniques. Construction of frameworks capable of providing both the productivity benefits of ADL-generated simulators and the performance benefits of DBT remains a significant challenge. We introduce Pydgin, a new approach to ISS construction that addresses the multiple challenges of designing, implementing, and maintaining ADL-generated DBT-ISSs. Pydgin uses a Python-based, embedded-ADL to succinctly describe instruction behavior as directly executable “pseudocode”. These Pydgin ADL descriptions are used to automatically generate high-performance DBT-ISSs by creatively adapting an existing meta-tracing JIT compilation framework designed for general-purpose dynamic programming languages. We demonstrate the capabilities of Pydgin by implementing ISSs for two instruction sets and show that Pydgin provides concise, flexible ISA descriptions while also generating simulators with performance comparable to hand-coded DBT-ISSs.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134258300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Non-volatile memory host controller interface performance analysis in high-performance I/O systems 高性能I/O系统中非易失性存储器主控制器接口性能分析

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095793

Amro Awad, B. Kettering, Yan Solihin

Emerging non-volatile memories (NVMs), such as Phase-Change Memory (PCM), Spin-Transfer Torque RAM (STT-RAM) and Memristor, are very promising candidates for replacing NAND-Flash Solid-State Drives (SSDs) and Hard Disk Drives (HDDs) for many reasons. First, their read/write latencies are orders of magnitude faster. Second, some emerging NVMs, such as memristors, are expected to have very high densities, which allow deploying a much higher capacity without requiring increased physical space. While the percentage of the time taken for data movement over low-speed buses, such as Peripheral Component Interconnect (PCI), is negligible for the overall read/write latency in HDDs, it could be dominant for emerging fast NVMs. Therefore, the trend has moved toward using very fast interconnect technologies, such as PCI Express (PCIe) which is hundreds of times faster than the traditional PCI. Accordingly, new host controller interfaces are used to communicate with I/O devices to exploit the parallelism and low-latency features of emerging NVMs through high-speed interconnects. In this paper, we investigate the system performance bottlenecks and overhead of using the standard state-of-the-art Non-Volatile Memory Express (NVMe), or Non-Volatile Memory Host Controller Interface (NVMHCI) Specification [1] as representative for NVM host controller interfaces.

新兴的非易失性存储器(nvm)，如相变存储器(PCM)，自旋传递扭矩RAM (STT-RAM)和忆阻器，由于许多原因，是取代nand闪存固态硬盘(ssd)和硬盘驱动器(hdd)的非常有前途的候选者。首先，它们的读/写延迟要快几个数量级。其次，一些新兴的nvm，如忆阻器，预计具有非常高的密度，这允许在不需要增加物理空间的情况下部署更高的容量。虽然通过低速总线(如外围组件互连(PCI))进行数据移动所花费的时间百分比对于hdd中的总体读/写延迟可以忽略不计，但对于新兴的快速nvm来说，它可能占主导地位。因此，趋势已经转向使用非常快速的互连技术，例如比传统PCI快数百倍的PCI Express (PCIe)。因此，新的主机控制器接口被用于与I/O设备通信，通过高速互连利用新兴nvm的并行性和低延迟特性。在本文中，我们研究了使用标准的最先进的非易失性存储器快速(NVMe)或非易失性存储器主机控制器接口(NVMHCI)规范[1]作为NVM主机控制器接口代表的系统性能瓶颈和开销。

{"title":"Non-volatile memory host controller interface performance analysis in high-performance I/O systems","authors":"Amro Awad, B. Kettering, Yan Solihin","doi":"10.1109/ISPASS.2015.7095793","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095793","url":null,"abstract":"Emerging non-volatile memories (NVMs), such as Phase-Change Memory (PCM), Spin-Transfer Torque RAM (STT-RAM) and Memristor, are very promising candidates for replacing NAND-Flash Solid-State Drives (SSDs) and Hard Disk Drives (HDDs) for many reasons. First, their read/write latencies are orders of magnitude faster. Second, some emerging NVMs, such as memristors, are expected to have very high densities, which allow deploying a much higher capacity without requiring increased physical space. While the percentage of the time taken for data movement over low-speed buses, such as Peripheral Component Interconnect (PCI), is negligible for the overall read/write latency in HDDs, it could be dominant for emerging fast NVMs. Therefore, the trend has moved toward using very fast interconnect technologies, such as PCI Express (PCIe) which is hundreds of times faster than the traditional PCI. Accordingly, new host controller interfaces are used to communicate with I/O devices to exploit the parallelism and low-latency features of emerging NVMs through high-speed interconnects. In this paper, we investigate the system performance bottlenecks and overhead of using the standard state-of-the-art Non-Volatile Memory Express (NVMe), or Non-Volatile Memory Host Controller Interface (NVMHCI) Specification [1] as representative for NVM host controller interfaces.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131152531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

An updated performance comparison of virtual machines and Linux containers 更新了虚拟机和Linux容器的性能比较

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095802

Wes Felter, Alexandre Ferreira, R. Rajamony, J. Rubio

Cloud computing makes extensive use of virtual machines because they permit workloads to be isolated from one another and for the resource usage to be somewhat controlled. In this paper, we explore the performance of traditional virtual machine (VM) deployments, and contrast them with the use of Linux containers. We use KVM as a representative hypervisor and Docker as a container manager. Our results show that containers result in equal or better performance than VMs in almost all cases. Both VMs and containers require tuning to support I/Ointensive applications. We also discuss the implications of our performance results for future cloud architectures.

云计算广泛使用虚拟机，因为它们允许将工作负载相互隔离，并在一定程度上控制资源使用。在本文中，我们探讨了传统虚拟机(VM)部署的性能，并将它们与Linux容器的使用进行了对比。我们使用KVM作为典型的管理程序，使用Docker作为容器管理器。我们的结果表明，在几乎所有情况下，容器的性能都与vm相当甚至更好。vm和容器都需要调优以支持I/ o密集型应用程序。我们还讨论了性能结果对未来云架构的影响。

引用次数: 1084