全速前进:接近本地速度的详细架构模拟

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI:10.1109/IISWC.2015.29

Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, S. Kaxiras, D. Black-Schaffer

{"title":"全速前进:接近本地速度的详细架构模拟","authors":"Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, S. Kaxiras, D. Black-Schaffer","doi":"10.1109/IISWC.2015.29","DOIUrl":null,"url":null,"abstract":"Cycle-level micro architectural simulation is the de-facto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through micro architectural state check pointing, one can quickly and easily determine the performance characteristics of a workload for a variety of micro architectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the micro architecture it is running on) cause this methodology to break down, as new micro architectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed\",\"authors\":\"Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, S. Kaxiras, D. Black-Schaffer\",\"doi\":\"10.1109/IISWC.2015.29\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cycle-level micro architectural simulation is the de-facto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through micro architectural state check pointing, one can quickly and easily determine the performance characteristics of a workload for a variety of micro architectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the micro architecture it is running on) cause this methodology to break down, as new micro architectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.\",\"PeriodicalId\":142698,\"journal\":{\"name\":\"2015 IEEE International Symposium on Workload Characterization\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Symposium on Workload Characterization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC.2015.29\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Symposium on Workload Characterization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2015.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 44

摘要

周期级微架构仿真是评估下一代平台性能的事实上的标准。不幸的是，精确模拟所需的细节级别需要复杂的，因此缓慢的仿真模型，其运行速度比本机执行速度慢数千倍。随着采样模拟的引入，可以在合理的时间内只模拟工作负载的关键、代表性部分，并可靠地估计其整体性能。这些抽样方法提供了识别区域以进行详细执行的能力，并且通过微体系结构状态检查点，可以快速轻松地确定各种微体系结构更改的工作负载的性能特征。虽然这种抽样模拟生成检查点的策略对于静态应用程序执行良好，但涉及软硬件协同设计的更复杂场景(例如共同优化Java虚拟机及其运行的微体系结构)会导致这种方法失效，因为每个内存层次结构配置和软件版本都需要新的微体系结构检查点。因此，需要解决方案来实现快速准确的仿真，同时满足硬件软件协同设计和探索的需求。在这项工作中，我们提出了一种方法来增强基于检查点的采样模拟。我们的解决方案集成了硬件虚拟化，以提供接近本地的速度，虚拟化的快速转发到感兴趣的区域，以及并行的详细模拟。然而，由于我们不能在虚拟快进过程中加热模拟缓存，我们开发了一种新的方法，通过使用乐观和悲观的变暖模拟来估计有限缓存变暖带来的误差。使用虚拟化快速转发(其平均运行速度为本机速度的90%)，我们演示了一个并行采样模拟器，该模拟器可用于准确估计标准工作负载的IPC，平均误差为2.2%，同时平均执行速度仍达到2.0 GIPS(本机速度的63%)。此外，我们证明了我们的并行化策略几乎是线性扩展的，并且在使用8个核心时，以高达93%的原生执行速率模拟一个核心，比详细模拟快19,000倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed

Cycle-level micro architectural simulation is the de-facto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through micro architectural state check pointing, one can quickly and easily determine the performance characteristics of a workload for a variety of micro architectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the micro architecture it is running on) cause this methodology to break down, as new micro architectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE International Symposium on Workload Characterization

自引率

0.00%

发文量

期刊最新文献

Fast Computational GPU Design with GT-Pin On Power-Performance Characterization of Concurrent Throughput Kernels CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores Exploring Parallel Programming Models for Heterogeneous Computing Systems Revealing Critical Loads and Hidden Data Locality in GPGPU Applications