Web search as a service is very impressive. Web search runs on thousands of servers which perform search on an index of billions of web pages. The search results must be both relevant to the user queries and reach the user in a fraction of a second. A web search service must guarantee the same QoS at all times even at the peak incoming traffic load. Not unjustifiably the web search service has attracted a lot of research attention. Despite the high research interest web search has gained, there are still plenty unknown about the functionality and the architecture of web search benchmarks. Much research has been done using commercial web search engines, like Bing or Google, but many details of these search engines are, of course, not disclosed to the public. We take an academically accepted web search benchmark and we perform a thorough characterization and analysis of it. We shed light in to the architecture and the functionality of the benchmark. We also investigate some prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput and we also explore the potential use of low power servers for web search. Our results show that intra-server partitioning can reduce tail latencies and that low power servers given enough partitioning can provide same response times as conventional high performance servers.
{"title":"Characterization and analysis of a web search benchmark","authors":"Zacharias Hadjilambrou, Marios Kleanthous, Yiannakis Sazeides","doi":"10.1109/ISPASS.2015.7095818","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095818","url":null,"abstract":"Web search as a service is very impressive. Web search runs on thousands of servers which perform search on an index of billions of web pages. The search results must be both relevant to the user queries and reach the user in a fraction of a second. A web search service must guarantee the same QoS at all times even at the peak incoming traffic load. Not unjustifiably the web search service has attracted a lot of research attention. Despite the high research interest web search has gained, there are still plenty unknown about the functionality and the architecture of web search benchmarks. Much research has been done using commercial web search engines, like Bing or Google, but many details of these search engines are, of course, not disclosed to the public. We take an academically accepted web search benchmark and we perform a thorough characterization and analysis of it. We shed light in to the architecture and the functionality of the benchmark. We also investigate some prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput and we also explore the potential use of low power servers for web search. Our results show that intra-server partitioning can reduce tail latencies and that low power servers given enough partitioning can provide same response times as conventional high performance servers.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126071641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095781
G. Blake, A. Saidi
To function correctly Online, Data-Intensive (OLDI) services require low and consistent service times. Maintaining predictable service times entails requiring 99th or higher percentile latency targets across hundreds to thousands of servers in the data-center. However, to maintain the 99th percentile targets servers are routinely run well below full utilization. The main difficulty in optimizing a server to run closer to peak utilization and maintain predictable 99th percentile response latencies is identifying and mitigating the causes of a request missing the target service time. In practice this analysis is challenging as requests and responses overlap their execution with respect to one another and traverse multiple layers of software, user/kernel protection boundaries, and the hardware/software divide. Traditional profiling methods that record the time being spent in each function usually yield few clues as to where a bottleneck may be present due to the many layers of software each consuming only a small fraction of time each. In this work we analyze the end-to-end sources of latency in a Memcached server from the wire through the kernel into the application and back again. To do so, we develop a tool that utilizes the Linux SystemTap infrastructure to measure latency throughout the many software layers that make up the complete request and response path for Memcached. While memory copies and the Linux networking stack are often suggested as major contributors to latency, we find that the main cause of missing response latency guarantees is the formation of standing queues and the application's inability to detect and remedy this situation.
{"title":"Where does the time go? characterizing tail latency in memcached","authors":"G. Blake, A. Saidi","doi":"10.1109/ISPASS.2015.7095781","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095781","url":null,"abstract":"To function correctly Online, Data-Intensive (OLDI) services require low and consistent service times. Maintaining predictable service times entails requiring 99th or higher percentile latency targets across hundreds to thousands of servers in the data-center. However, to maintain the 99th percentile targets servers are routinely run well below full utilization. The main difficulty in optimizing a server to run closer to peak utilization and maintain predictable 99th percentile response latencies is identifying and mitigating the causes of a request missing the target service time. In practice this analysis is challenging as requests and responses overlap their execution with respect to one another and traverse multiple layers of software, user/kernel protection boundaries, and the hardware/software divide. Traditional profiling methods that record the time being spent in each function usually yield few clues as to where a bottleneck may be present due to the many layers of software each consuming only a small fraction of time each. In this work we analyze the end-to-end sources of latency in a Memcached server from the wire through the kernel into the application and back again. To do so, we develop a tool that utilizes the Linux SystemTap infrastructure to measure latency throughout the many software layers that make up the complete request and response path for Memcached. While memory copies and the Linux networking stack are often suggested as major contributors to latency, we find that the main cause of missing response latency guarantees is the formation of standing queues and the application's inability to detect and remedy this situation.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127353194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095780
Michael Papamichael, Cagla Cakir, Chen Sun, C. Chen, J. Hoe, K. Mai, L. Peh, V. Stojanović
Computer architects are increasingly interested in evaluating their ideas at the register-transfer level (RTL) to gain more precise insights on the key characteristics (frequency, area, power) of a micro/architectural design proposal. However, the RTL synthesis process is notoriously tedious, slow, and errorprone and is often outside the area of expertise of a typical computer architect, as it requires familiarity with complex CAD flows, hard-to-get tools and standard cell libraries. The effort is further multiplied when targeting multiple technology nodes and standard cell variants to study technology dependence. This paper presents DELPHI, a flexible, open framework that leverages the DSENT modeling engine for faster, easier, and more efficient characterization of RTL hardware designs. DELPHI first synthesizes a Verilog or VHDL RTL design (either using the industry-standard Synopsys Design Compiler tool or a combination of open-source tools) to an intermediate structural netlist. It then processes the resulting synthesized netlist to generate a technology-independent DSENT design model. This model can then be used within a modified version of the DSENT flow to perform very fast-one to two orders of magnitude faster than full RTL synthesis-estimation of hardware performance characteristics, such as frequency, area, and power across a variety of DSENT technology models (e.g., 65nm Bulk, 32nm SOI, 11nm Tri-Gate, etc.). In our evaluation using 26 RTL design examples, DELPHI and DSENT were consistently able to closely track and capture design trends of conventional RTL synthesis results without the associated delay and complexity. We are releasing the full DELPHI framework (including a fully open-source flow) at http://www.ece.cmu.edu/CALCM/delphi/.
{"title":"DELPHI: a framework for RTL-based architecture design evaluation using DSENT models","authors":"Michael Papamichael, Cagla Cakir, Chen Sun, C. Chen, J. Hoe, K. Mai, L. Peh, V. Stojanović","doi":"10.1109/ISPASS.2015.7095780","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095780","url":null,"abstract":"Computer architects are increasingly interested in evaluating their ideas at the register-transfer level (RTL) to gain more precise insights on the key characteristics (frequency, area, power) of a micro/architectural design proposal. However, the RTL synthesis process is notoriously tedious, slow, and errorprone and is often outside the area of expertise of a typical computer architect, as it requires familiarity with complex CAD flows, hard-to-get tools and standard cell libraries. The effort is further multiplied when targeting multiple technology nodes and standard cell variants to study technology dependence. This paper presents DELPHI, a flexible, open framework that leverages the DSENT modeling engine for faster, easier, and more efficient characterization of RTL hardware designs. DELPHI first synthesizes a Verilog or VHDL RTL design (either using the industry-standard Synopsys Design Compiler tool or a combination of open-source tools) to an intermediate structural netlist. It then processes the resulting synthesized netlist to generate a technology-independent DSENT design model. This model can then be used within a modified version of the DSENT flow to perform very fast-one to two orders of magnitude faster than full RTL synthesis-estimation of hardware performance characteristics, such as frequency, area, and power across a variety of DSENT technology models (e.g., 65nm Bulk, 32nm SOI, 11nm Tri-Gate, etc.). In our evaluation using 26 RTL design examples, DELPHI and DSENT were consistently able to closely track and capture design trends of conventional RTL synthesis results without the associated delay and complexity. We are releasing the full DELPHI framework (including a fully open-source flow) at http://www.ece.cmu.edu/CALCM/delphi/.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128279220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095807
Matthew Halpern, Yuhao Zhu, R. Peri, V. Reddi
In contrast to traditional computing systems, such as desktops and servers, that are programmed to perform “compute-bound” and “run-to-completion” tasks, mobile applications are designed for user interactivity. Factoring user interactivity into computer system design and evaluation is important, yet possesses many challenges. In particular, systematically studying interactive mobile applications across the diverse set of mobile devices available today is difficult due to the mobile device fragmentation problem. At the time of writing, there are 18,796 distinct Android mobile devices on the market and will only continue to increase in the future. Differences in screen sizes, resolutions and operating systems impose different interactivity requirements, making it difficult to uniformly study these systems. We present Mosaic, a cross-platform, timing-accurate record and replay tool for Android-based mobile devices. Mosaic overcomes device fragmentation through a novel virtual screen abstraction. User interactions are translated from a physical device into a platform-agnostic intermediate representation before translation to a target system. The intermediate representation is human-readable, which allows Mosaic users to modify previously recorded traces or even synthesize their own user interactive sessions from scratch. We demonstrate that Mosaic allows user interaction traces to be recorded on emulators, smartphones, tablets, and development boards and replayed on other devices. Using Mosaic we were able to replay 45 different Google Play applications across multiple devices, and also show that we can perform cross-platform performance comparisons between two different processors under identical user interactions.
{"title":"Mosaic: cross-platform user-interaction record and replay for the fragmented android ecosystem","authors":"Matthew Halpern, Yuhao Zhu, R. Peri, V. Reddi","doi":"10.1109/ISPASS.2015.7095807","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095807","url":null,"abstract":"In contrast to traditional computing systems, such as desktops and servers, that are programmed to perform “compute-bound” and “run-to-completion” tasks, mobile applications are designed for user interactivity. Factoring user interactivity into computer system design and evaluation is important, yet possesses many challenges. In particular, systematically studying interactive mobile applications across the diverse set of mobile devices available today is difficult due to the mobile device fragmentation problem. At the time of writing, there are 18,796 distinct Android mobile devices on the market and will only continue to increase in the future. Differences in screen sizes, resolutions and operating systems impose different interactivity requirements, making it difficult to uniformly study these systems. We present Mosaic, a cross-platform, timing-accurate record and replay tool for Android-based mobile devices. Mosaic overcomes device fragmentation through a novel virtual screen abstraction. User interactions are translated from a physical device into a platform-agnostic intermediate representation before translation to a target system. The intermediate representation is human-readable, which allows Mosaic users to modify previously recorded traces or even synthesize their own user interactive sessions from scratch. We demonstrate that Mosaic allows user interaction traces to be recorded on emulators, smartphones, tablets, and development boards and replayed on other devices. Using Mosaic we were able to replay 45 different Google Play applications across multiple devices, and also show that we can perform cross-platform performance comparisons between two different processors under identical user interactions.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130436273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095800
Junjie Qian, Du Li, W. Srisa-an, Hong Jiang, S. Seth
Modern Java applications employ multithreading to improve performance by harnessing execution parallelism available in today's multicore processors. However, as the numbers of threads and processing cores are scaled up, many applications do not achieve the desired level of performance improvement. In this paper, we explore two factors, lock contention and garbage collection performance that can affect scalability of Java applications. Our initial result reveals two new observations. First, applications that are highly scalable may experience more instances of lock contention than those experienced by applications that are less scalable. Second, efficient multithreading can make garbage collection less effective, and therefore, negatively impacting garbage collection performance.
{"title":"Factors affecting scalability of multithreaded Java applications on manycore systems","authors":"Junjie Qian, Du Li, W. Srisa-an, Hong Jiang, S. Seth","doi":"10.1109/ISPASS.2015.7095800","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095800","url":null,"abstract":"Modern Java applications employ multithreading to improve performance by harnessing execution parallelism available in today's multicore processors. However, as the numbers of threads and processing cores are scaled up, many applications do not achieve the desired level of performance improvement. In this paper, we explore two factors, lock contention and garbage collection performance that can affect scalability of Java applications. Our initial result reveals two new observations. First, applications that are highly scalable may experience more instances of lock contention than those experienced by applications that are less scalable. Second, efficient multithreading can make garbage collection less effective, and therefore, negatively impacting garbage collection performance.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123372261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095785
Xiaoyue Pan, B. Jonsson
We develop an analytical modeling framework for efficient prediction of cache miss ratios based on reuse distance distributions. The only input needed for our predictions is the reuse distance distribution of a program execution: previous work has shown that they can be obtained with very small overhead by sampling from native executions. This should be contrasted with previous approaches that base predictions on stack distance distributions, whose collection need significantly larger overhead or additional hardware support. The predictions are based on a uniform modeling framework which can be specialized for a variety of cache replacement policies, including Random, LRU, PLRU, and MRU (aka. bit-PLRU), and for arbitrary values of cache size and cache associativity. We evaluate our modeling framework with the SPEC CPU 2006 benchmark suite over a set of cache configurations with varying cache size, associativity and replacement policy. The introduced inaccuracies were generally below 1% for the model of the policy, and additionally around 2% when set-local reuse distances must be estimated from global reuse distance distributions. The inaccuracy introduced by sampling is significantly smaller.
我们开发了一个基于重用距离分布的有效预测缓存缺失率的分析建模框架。我们预测所需的唯一输入是程序执行的重用距离分布:以前的工作表明,通过从本机执行中抽样,可以以非常小的开销获得它们。这应该与以前基于堆栈距离分布的预测方法形成对比,后者的收集需要更大的开销或额外的硬件支持。预测基于统一的建模框架,该框架可以专门用于各种缓存替换策略,包括Random, LRU, PLRU和MRU。bit-PLRU),以及缓存大小和缓存关联性的任意值。我们使用SPEC CPU 2006基准测试套件对一组缓存配置进行评估,这些配置具有不同的缓存大小、关联性和替换策略。对于策略模型,引入的不准确性通常低于1%,当必须从全局重用距离分布估计集局部重用距离时,引入的不准确性约为2%。由抽样引入的不准确性明显更小。
{"title":"A modeling framework for reuse distance-based estimation of cache performance","authors":"Xiaoyue Pan, B. Jonsson","doi":"10.1109/ISPASS.2015.7095785","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095785","url":null,"abstract":"We develop an analytical modeling framework for efficient prediction of cache miss ratios based on reuse distance distributions. The only input needed for our predictions is the reuse distance distribution of a program execution: previous work has shown that they can be obtained with very small overhead by sampling from native executions. This should be contrasted with previous approaches that base predictions on stack distance distributions, whose collection need significantly larger overhead or additional hardware support. The predictions are based on a uniform modeling framework which can be specialized for a variety of cache replacement policies, including Random, LRU, PLRU, and MRU (aka. bit-PLRU), and for arbitrary values of cache size and cache associativity. We evaluate our modeling framework with the SPEC CPU 2006 benchmark suite over a set of cache configurations with varying cache size, associativity and replacement policy. The introduced inaccuracies were generally below 1% for the model of the policy, and additionally around 2% when set-local reuse distances must be estimated from global reuse distance distributions. The inaccuracy introduced by sampling is significantly smaller.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"454 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134497351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095787
Bin Li, Shaoming Chen, Lu Peng
Performance variability, stemming from non-deterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering variability. This may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g. ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance and a five-number-summary to summarize computer performance. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples. Results show that our methods are precise and robust even when two computers have very similar performance metrics.
{"title":"Precise computer comparisons via statistical resampling methods","authors":"Bin Li, Shaoming Chen, Lu Peng","doi":"10.1109/ISPASS.2015.7095787","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095787","url":null,"abstract":"Performance variability, stemming from non-deterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering variability. This may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g. ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance and a five-number-summary to summarize computer performance. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples. Results show that our methods are precise and robust even when two computers have very similar performance metrics.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116818624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095811
Derek Lockhart, Berkin Ilbeyi, C. Batten
Instruction set simulators (ISSs) remain an essential tool for the rapid exploration and evaluation of instruction set extensions in both academia and industry. Due to their importance in both hardware and software design, modern ISSs must balance a tension between developer productivity and high-performance simulation. Productivity requirements have led to “ADL-driven” toolflows that automatically generate ISSs from high-level architectural description languages (ADLs). Meanwhile, performance requirements have prompted ISSs to incorporate increasingly complicated dynamic binary translation (DBT) techniques. Construction of frameworks capable of providing both the productivity benefits of ADL-generated simulators and the performance benefits of DBT remains a significant challenge. We introduce Pydgin, a new approach to ISS construction that addresses the multiple challenges of designing, implementing, and maintaining ADL-generated DBT-ISSs. Pydgin uses a Python-based, embedded-ADL to succinctly describe instruction behavior as directly executable “pseudocode”. These Pydgin ADL descriptions are used to automatically generate high-performance DBT-ISSs by creatively adapting an existing meta-tracing JIT compilation framework designed for general-purpose dynamic programming languages. We demonstrate the capabilities of Pydgin by implementing ISSs for two instruction sets and show that Pydgin provides concise, flexible ISA descriptions while also generating simulators with performance comparable to hand-coded DBT-ISSs.
{"title":"Pydgin: generating fast instruction set simulators from simple architecture descriptions with meta-tracing JIT compilers","authors":"Derek Lockhart, Berkin Ilbeyi, C. Batten","doi":"10.1109/ISPASS.2015.7095811","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095811","url":null,"abstract":"Instruction set simulators (ISSs) remain an essential tool for the rapid exploration and evaluation of instruction set extensions in both academia and industry. Due to their importance in both hardware and software design, modern ISSs must balance a tension between developer productivity and high-performance simulation. Productivity requirements have led to “ADL-driven” toolflows that automatically generate ISSs from high-level architectural description languages (ADLs). Meanwhile, performance requirements have prompted ISSs to incorporate increasingly complicated dynamic binary translation (DBT) techniques. Construction of frameworks capable of providing both the productivity benefits of ADL-generated simulators and the performance benefits of DBT remains a significant challenge. We introduce Pydgin, a new approach to ISS construction that addresses the multiple challenges of designing, implementing, and maintaining ADL-generated DBT-ISSs. Pydgin uses a Python-based, embedded-ADL to succinctly describe instruction behavior as directly executable “pseudocode”. These Pydgin ADL descriptions are used to automatically generate high-performance DBT-ISSs by creatively adapting an existing meta-tracing JIT compilation framework designed for general-purpose dynamic programming languages. We demonstrate the capabilities of Pydgin by implementing ISSs for two instruction sets and show that Pydgin provides concise, flexible ISA descriptions while also generating simulators with performance comparable to hand-coded DBT-ISSs.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134258300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095793
Amro Awad, B. Kettering, Yan Solihin
Emerging non-volatile memories (NVMs), such as Phase-Change Memory (PCM), Spin-Transfer Torque RAM (STT-RAM) and Memristor, are very promising candidates for replacing NAND-Flash Solid-State Drives (SSDs) and Hard Disk Drives (HDDs) for many reasons. First, their read/write latencies are orders of magnitude faster. Second, some emerging NVMs, such as memristors, are expected to have very high densities, which allow deploying a much higher capacity without requiring increased physical space. While the percentage of the time taken for data movement over low-speed buses, such as Peripheral Component Interconnect (PCI), is negligible for the overall read/write latency in HDDs, it could be dominant for emerging fast NVMs. Therefore, the trend has moved toward using very fast interconnect technologies, such as PCI Express (PCIe) which is hundreds of times faster than the traditional PCI. Accordingly, new host controller interfaces are used to communicate with I/O devices to exploit the parallelism and low-latency features of emerging NVMs through high-speed interconnects. In this paper, we investigate the system performance bottlenecks and overhead of using the standard state-of-the-art Non-Volatile Memory Express (NVMe), or Non-Volatile Memory Host Controller Interface (NVMHCI) Specification [1] as representative for NVM host controller interfaces.
{"title":"Non-volatile memory host controller interface performance analysis in high-performance I/O systems","authors":"Amro Awad, B. Kettering, Yan Solihin","doi":"10.1109/ISPASS.2015.7095793","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095793","url":null,"abstract":"Emerging non-volatile memories (NVMs), such as Phase-Change Memory (PCM), Spin-Transfer Torque RAM (STT-RAM) and Memristor, are very promising candidates for replacing NAND-Flash Solid-State Drives (SSDs) and Hard Disk Drives (HDDs) for many reasons. First, their read/write latencies are orders of magnitude faster. Second, some emerging NVMs, such as memristors, are expected to have very high densities, which allow deploying a much higher capacity without requiring increased physical space. While the percentage of the time taken for data movement over low-speed buses, such as Peripheral Component Interconnect (PCI), is negligible for the overall read/write latency in HDDs, it could be dominant for emerging fast NVMs. Therefore, the trend has moved toward using very fast interconnect technologies, such as PCI Express (PCIe) which is hundreds of times faster than the traditional PCI. Accordingly, new host controller interfaces are used to communicate with I/O devices to exploit the parallelism and low-latency features of emerging NVMs through high-speed interconnects. In this paper, we investigate the system performance bottlenecks and overhead of using the standard state-of-the-art Non-Volatile Memory Express (NVMe), or Non-Volatile Memory Host Controller Interface (NVMHCI) Specification [1] as representative for NVM host controller interfaces.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131152531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095802
Wes Felter, Alexandre Ferreira, R. Rajamony, J. Rubio
Cloud computing makes extensive use of virtual machines because they permit workloads to be isolated from one another and for the resource usage to be somewhat controlled. In this paper, we explore the performance of traditional virtual machine (VM) deployments, and contrast them with the use of Linux containers. We use KVM as a representative hypervisor and Docker as a container manager. Our results show that containers result in equal or better performance than VMs in almost all cases. Both VMs and containers require tuning to support I/Ointensive applications. We also discuss the implications of our performance results for future cloud architectures.
{"title":"An updated performance comparison of virtual machines and Linux containers","authors":"Wes Felter, Alexandre Ferreira, R. Rajamony, J. Rubio","doi":"10.1109/ISPASS.2015.7095802","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095802","url":null,"abstract":"Cloud computing makes extensive use of virtual machines because they permit workloads to be isolated from one another and for the resource usage to be somewhat controlled. In this paper, we explore the performance of traditional virtual machine (VM) deployments, and contrast them with the use of Linux containers. We use KVM as a representative hypervisor and Docker as a container manager. Our results show that containers result in equal or better performance than VMs in almost all cases. Both VMs and containers require tuning to support I/Ointensive applications. We also discuss the implications of our performance results for future cloud architectures.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126865420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}