Proceedings. International Symposium on Computer Architecture最新文献_第6页

Implementing a GPU programming model on a Non-GPU accelerator architecture 在非GPU加速器架构上实现GPU编程模型

Proceedings. International Symposium on Computer Architecture

Pub Date : 2010-06-19 DOI: 10.1007/978-3-642-24322-6_5

Stephen M. Kofsky, Daniel R. Johnson, John A. Stratton, Wen-mei W. Hwu, Sanjay J. Patel, S. Lumetta

引用次数: 3

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness 具有内存级和线程级并行意识的GPU架构的分析模型

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-20 DOI: 10.1145/1555754.1555775

Sunpyo Hong, Hyesoon Kim

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

GPU架构由于其大量并行处理器而在多核时代变得越来越重要。对软件工程师来说，编程数千个大规模并行线程是一个巨大的挑战，但了解GPU架构上这些并行程序的性能瓶颈以提高应用程序的性能则更加困难。当前的方法依赖于程序员在不完全了解应用程序的性能特征的情况下，通过充分利用设计空间来调优应用程序。为了深入了解GPU架构上并行应用程序的性能瓶颈，我们提出了一个简单的分析模型，用于估计大规模并行程序的执行时间。我们模型的关键组件是通过考虑运行线程的数量和内存带宽来估计并行内存请求的数量(我们称之为内存翘曲并行性)。基于内存翘曲并行度，该模型估计内存请求的成本，从而估计程序的总体执行时间。将模型结果与几种GPU的实际执行时间进行比较，结果表明，我们的模型在微基准测试中的绝对误差几何平均值为5.4%，在GPU计算应用中的绝对误差几何平均值为13.3%。所有应用程序都是用CUDA编程语言编写的。

{"title":"An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness","authors":"Sunpyo Hong, Hyesoon Kim","doi":"10.1145/1555754.1555775","DOIUrl":"https://doi.org/10.1145/1555754.1555775","url":null,"abstract":"GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.\u0000 To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"28 1","pages":"152-163"},"PeriodicalIF":0.0,"publicationDate":"2009-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83800451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 699

Scalable high performance main memory system using phase-change memory technology 采用相变存储器技术的可扩展高性能主存储器系统

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555760

Moinuddin K. Qureshi, V. Srinivasan, J. Rivers

The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints. In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM.We explore the trade-offs for a main memory system consisting of PCMstorage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.

存储子系统在计算机系统中占有相当大的成本和功耗预算。目前基于dram的主存储系统已经开始触及功率和成本的极限。电路界正在积极研究一种在相变材料中使用电阻对比的替代存储技术。相对于DRAM，相变存储器(PCM)器件提供了更高的密度，并且可以帮助增加未来系统的主存储器容量，同时保持在成本和功率限制之内。本文利用PCM的体系结构层次模型分析了一种基于PCM的混合主存系统。我们探讨了由pcm存储器和一个小的DRAM缓冲区组成的主存储器系统的权衡。这种架构具有DRAM的延迟优势和PCM的容量优势。我们对具有8GB DRAM的16核基准系统的评估表明，平均而言，PCM可以将页面错误减少5倍，并提供3倍的加速。由于预计PCM的写入持久性有限，我们还提出了混合内存的简单组织和管理解决方案，以减少对PCM的写入流量，将其使用寿命从3年提高到9.7年。

{"title":"Scalable high performance main memory system using phase-change memory technology","authors":"Moinuddin K. Qureshi, V. Srinivasan, J. Rivers","doi":"10.1145/1555754.1555760","DOIUrl":"https://doi.org/10.1145/1555754.1555760","url":null,"abstract":"The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints.\u0000 In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM.We explore the trade-offs for a main memory system consisting of PCMstorage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"29 1","pages":"24-33"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89935064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1358

Multi-execution: multicore caching for data-similar executions 多执行:用于数据类似执行的多核缓存

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555777

Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong

While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as "multi-execution." Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.

当微处理器设计人员转向多核架构以维持性能期望时，这种架构的并行性的急剧增加将对片外带宽提出大量要求，并使内存墙比以往任何时候都更加重要。本文论证了多核处理器的一个有益的应用是执行同一个程序的许多类似实例。我们确定这种执行模型在几个实际场景中使用，并将其称为“多执行”。通常，每个这样的实例都使用非常相似的数据。在传统的缓存层次结构中，每个实例将独立地缓存自己的数据。我们提出了可合并的缓存架构，它可以检测数据相似性并合并缓存块，从而大大节省了缓存存储需求。这将减少片外内存访问和总体功耗，并提高应用程序性能。我们提供了8个基准测试的周期精确模拟结果(6个来自SPEC2000)，以证明我们的技术提供了一个可扩展的解决方案，并由于减少主内存访问而导致显着的速度提高。对于8个内核运行8个相同应用程序的类似执行，并共享一个独占的4 mb 8路L2缓存，可合并缓存显示执行速度平均提高2.5倍(范围从0.93倍到6.92倍)，而在使用它时，缓存面积的开销仅为4.28%，功耗仅为5.21%。

{"title":"Multi-execution: multicore caching for data-similar executions","authors":"Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong","doi":"10.1145/1555754.1555777","DOIUrl":"https://doi.org/10.1145/1555754.1555777","url":null,"abstract":"While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as \"multi-execution.\" Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"61 1","pages":"164-173"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75108586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Firefly: illuminating future network-on-chip with nanophotonics 萤火虫:用纳米光子学照亮未来的片上网络

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555808

Yan Pan, Prabhat Kumar, John Kim, G. Memik, Yu Zhang, A. Choudhary

Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.

未来的多核处理器将需要高性能且节能的片上网络，为越来越多的核心提供通信基板。硅纳米光子学的最新进展为片上网络创造了新的机会。为了有效地利用纳米光子学的优势，我们提出了萤火虫-一个混合的，分层的网络架构。萤火虫由使用传统电信号连接的节点簇组成，而簇间通信使用纳米光子学完成-利用电信号的优势进行短时间的局部通信，而纳米光子学仅用于全局通信，以实现高效的片上网络。集群间通信采用Crossbar架构。然而，为了避免全局仲裁，交叉栏被划分为多个逻辑交叉栏，并且它们的仲裁是局部的。我们的评估表明，在对抗性交通模式上，Firefly的性能比全电集中网格(CMESH)拓扑提高了57%，在具有局部性的交通模式上，与全光交叉条(OP XBAR)拓扑相比，Firefly的性能提高了54%。如果比较能量延迟乘积，Firefly与CMESH和OP XBAR相比，将片上网络的效率分别提高了51%和38%。

{"title":"Firefly: illuminating future network-on-chip with nanophotonics","authors":"Yan Pan, Prabhat Kumar, John Kim, G. Memik, Yu Zhang, A. Choudhary","doi":"10.1145/1555754.1555808","DOIUrl":"https://doi.org/10.1145/1555754.1555808","url":null,"abstract":"Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"12 1","pages":"429-440"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73953550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 420

Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors 用于芯片多处理器中动态性能、功率和资源管理的线程临界性预测器

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555792

A. Bhattacharjee, M. Martonosi

With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.

随着向芯片多处理器(cmp)的转变，开发和管理并行性已经成为计算系统中的一个核心问题。并行管理的许多问题归结为识别哪些正在运行的线程或进程是关键的，或者是最慢的，哪些是非关键的。如果可以准确地预测并行程序中的关键线程，那么就可以以各种方式进行响应。可能性包括以更快的时钟速率运行关键线程，执行负载平衡技术将工作卸载到当前非关键线程上，或者为关键线程提供更多的片上资源以更快地执行。本文提出并评估了简单而有效的并行应用线程临界预测器。我们表明，可以使用芯片上通常已经可用的计数器构建准确的预测器。我们的预测器基于内存层次统计信息，在一系列体系结构中识别线程临界性的平均准确率为93%。我们还演示了我们的预测器的两个应用。首先，我们展示了英特尔的线程构建块(TBB)并行运行时系统如何从使用我们的临界预测器来减少负载不平衡的任务窃取技术中获益。对于运行在32核CMP上的基于TBB的PARSEC基准测试，使用临界性预测来指导TBB的任务窃取决策可以提高13-32%的性能。作为第二个应用，临界预测指导基于障壁的应用中的动态能量优化。通过以全时钟速率和频率缩放非关键线程运行预测的关键线程，这种方法可以平均节省15%的能源，同时可以忽略不计地降低SPLASH-2和PARSEC基准测试的性能。

{"title":"Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors","authors":"A. Bhattacharjee, M. Martonosi","doi":"10.1145/1555754.1555792","DOIUrl":"https://doi.org/10.1145/1555754.1555792","url":null,"abstract":"With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster.\u0000 This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures.\u0000 We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"14 1","pages":"290-301"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81310358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 192

End-to-end performance forecasting: finding bottlenecks before they happen 端到端性能预测:在瓶颈发生之前发现瓶颈

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555800

A. Saidi, N. Binkert, S. Reinhardt, T. Mudge

Many important workloads today, such as web-hosted services, are limited not by processor core performance but by interactions among the cores, the memory system, I/O devices, and the complex software layers that tie these components together. Architects designing future systems for these workloads are challenged to identify performance bottlenecks because, as in any concurrent system, overheads in one component may be hidden due to overlap with other operations. These overlaps span the user/kernel and software/hardware boundaries, making traditional performance analysis techniques inadequate. We present a methodology for identifying end-to-end critical paths across software and simulated hardware in complex networked systems. By modeling systems as collections of state machines interacting via queues, we can trace critical paths through multiplexed processing engines, identify when resources create bottlenecks (including abstract resources such as flow-control credits), and predict the benefit of eliminating bottlenecks by increasing hardware speeds or expanding available resources. We implement our technique in a full-system simulator and analyze a TCP microbenchmark, a web server, the Linux TCP/IP stack, and an Ethernet controller. From a single run of the microbenchmark, our tool--within minutes--correctly identifies a series of bottlenecks, and predicts the performance of hypothetical systems in which these bottlenecks are successively eliminated, culminating in a total speedup of 3X.We then validate these predictions through hours of additional simulation, and find them to be accurate within 1--17%. We also analyze the web server, find it to be CPU-bound, and predict the performance of a system with an additional core within 6%.

如今，许多重要的工作负载(如web托管服务)不是受到处理器核心性能的限制，而是受到核心、内存系统、I/O设备以及将这些组件连接在一起的复杂软件层之间的交互的限制。为这些工作负载设计未来系统的架构师面临着识别性能瓶颈的挑战，因为在任何并发系统中，由于与其他操作重叠，一个组件中的开销可能会被隐藏起来。这些重叠跨越了用户/内核和软件/硬件的边界，使得传统的性能分析技术不够充分。我们提出了一种在复杂网络系统中识别跨软件和模拟硬件的端到端关键路径的方法。通过将系统建模为通过队列交互的状态机集合，我们可以通过多路处理引擎跟踪关键路径，确定资源何时产生瓶颈(包括流控制信用等抽象资源)，并通过提高硬件速度或扩展可用资源来预测消除瓶颈的好处。我们在一个全系统模拟器中实现了我们的技术，并分析了一个TCP微基准测试、一个web服务器、Linux TCP/IP堆栈和一个以太网控制器。通过一次微基准测试，我们的工具在几分钟内就能正确地识别出一系列瓶颈，并预测出这些瓶颈被逐步消除的假设系统的性能，最终使总速度提高3倍。然后，我们通过数小时的额外模拟来验证这些预测，并发现它们的准确率在1- 17%之间。我们还分析了web服务器，发现它是cpu限制的，并预测在6%的范围内增加一个额外核心的系统的性能。

{"title":"End-to-end performance forecasting: finding bottlenecks before they happen","authors":"A. Saidi, N. Binkert, S. Reinhardt, T. Mudge","doi":"10.1145/1555754.1555800","DOIUrl":"https://doi.org/10.1145/1555754.1555800","url":null,"abstract":"Many important workloads today, such as web-hosted services, are limited not by processor core performance but by interactions among the cores, the memory system, I/O devices, and the complex software layers that tie these components together. Architects designing future systems for these workloads are challenged to identify performance bottlenecks because, as in any concurrent system, overheads in one component may be hidden due to overlap with other operations. These overlaps span the user/kernel and software/hardware boundaries, making traditional performance analysis techniques inadequate.\u0000 We present a methodology for identifying end-to-end critical paths across software and simulated hardware in complex networked systems. By modeling systems as collections of state machines interacting via queues, we can trace critical paths through multiplexed processing engines, identify when resources create bottlenecks (including abstract resources such as flow-control credits), and predict the benefit of eliminating bottlenecks by increasing hardware speeds or expanding available resources.\u0000 We implement our technique in a full-system simulator and analyze a TCP microbenchmark, a web server, the Linux TCP/IP stack, and an Ethernet controller. From a single run of the microbenchmark, our tool--within minutes--correctly identifies a series of bottlenecks, and predicts the performance of hypothetical systems in which these bottlenecks are successively eliminated, culminating in a total speedup of 3X.We then validate these predictions through hours of additional simulation, and find them to be accurate within 1--17%. We also analyze the web server, find it to be CPU-bound, and predict the performance of a system with an additional core within 6%.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"54 1","pages":"361-370"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76507998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Phastlane: a rapid transit optical routing network 相位网:一种快速传输的光路由网络

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555809

Mark J. Cianchetti, Joseph C. Kerekes, D. Albonesi

Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical interconnect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks. In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low-latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieve 2X better network performance than a state-of-the-art electrical baseline while consuming 80% less network power.

未来的微处理器预计将集成数十甚至数百个处理核心，使全球互连成为在给定功率范围内实现可扩展芯片性能的关键组件。虽然兼容cmos的纳米光子学已经成为取代22nm时间框架之外的全球电线的主要候选，但迄今为止提出的片上光学互连架构要么在可扩展性方面受到限制，要么依赖于相对较慢的电气控制网络。在本文中，我们提出了phaslane，一种用于未来大规模高速缓存相干多核微处理器的混合电/光路由网络。phaslane网络的核心是一个低延迟的光交叉棒，它使用简单的前置源路由在一个时钟周期内传输缓存行大小的数据包，在无争议的条件下传输几跳。当存在争用时，路由器使用电缓冲器，如果有必要，还使用高速下降信令网络。总体而言，phaslane的网络性能比最先进的电气基准提高了2倍，同时消耗的网络功率减少了80%。

{"title":"Phastlane: a rapid transit optical routing network","authors":"Mark J. Cianchetti, Joseph C. Kerekes, D. Albonesi","doi":"10.1145/1555754.1555809","DOIUrl":"https://doi.org/10.1145/1555754.1555809","url":null,"abstract":"Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical interconnect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks.\u0000 In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low-latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieve 2X better network performance than a state-of-the-art electrical baseline while consuming 80% less network power.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"5 1","pages":"441-450"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76500609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 186

Dynamic MIPS rate stabilization in out-of-order processors 乱序处理器中的动态MIPS速率稳定

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555763

Jinho Suh, M. Dubois

Today's microprocessor cores reach high performance levels not only by their high clock rate but also by the concurrent execution of a large number of instructions. Because of the relationship between power and frequency, it becomes attractive to run an OoO (Out-of-Order) core at a frequency lower than its nominal frequency in the context of embedded or real-time systems. Unfortunately, whereas OoO pipelines have high average throughput, their highly variable and hard-to-predict execution rate makes them unsuitable for real-time systems with hard or even soft deadlines. In this paper, we demonstrate that the execution time of an OoO processor can be stable and predictable by controlling its MIPS (Mega Instructions Per Second) rate via a PID (Proportional, Integral, and Differential gain) feedback controller and DVFS (Dynamic Voltage and Frequency Scaling). The stabilized processor uses much less power per committed instruction, because of the reduced average frequency. The EPI (Energy Per Instruction) is also cut by an average of 28% across our benchmark programs. Since a stable MIPS rate is maintained consistently with lower power/energy per instruction, OoO processors stabilized by a feedback controller can realistically be deployed in real-time systems. To demonstrate this capability we select a subset of the MiBench benchmarks that displays the widest execution rate variations and stabilize their MIPS rate in the context of a 1GHz Pentium III-like microarchitecture.

今天的微处理器内核达到高性能水平，不仅由于它们的高时钟速率，而且由于大量指令的并发执行。由于功率和频率之间的关系，在嵌入式或实时系统中，以低于其标称频率的频率运行OoO(乱序)核心变得很有吸引力。不幸的是，尽管OoO管道具有很高的平均吞吐量，但其高度可变且难以预测的执行率使其不适合具有硬甚至软截止日期的实时系统。在本文中，我们证明了通过PID(比例、积分和差分增益)反馈控制器和DVFS(动态电压和频率缩放)控制其MIPS(每秒百万指令)速率，OoO处理器的执行时间可以稳定和可预测。稳定的处理器由于降低了平均频率，每条提交指令的功耗要低得多。在我们的基准程序中，EPI(每指令能量)也平均降低了28%。由于稳定的MIPS速率与每条指令较低的功率/能量保持一致，因此由反馈控制器稳定的OoO处理器可以实际部署在实时系统中。为了展示这种能力，我们选择了MiBench基准测试的一个子集，它显示了最广泛的执行速率变化，并在1GHz Pentium iii类微架构的背景下稳定了它们的MIPS速率。

{"title":"Dynamic MIPS rate stabilization in out-of-order processors","authors":"Jinho Suh, M. Dubois","doi":"10.1145/1555754.1555763","DOIUrl":"https://doi.org/10.1145/1555754.1555763","url":null,"abstract":"Today's microprocessor cores reach high performance levels not only by their high clock rate but also by the concurrent execution of a large number of instructions. Because of the relationship between power and frequency, it becomes attractive to run an OoO (Out-of-Order) core at a frequency lower than its nominal frequency in the context of embedded or real-time systems. Unfortunately, whereas OoO pipelines have high average throughput, their highly variable and hard-to-predict execution rate makes them unsuitable for real-time systems with hard or even soft deadlines. In this paper, we demonstrate that the execution time of an OoO processor can be stable and predictable by controlling its MIPS (Mega Instructions Per Second) rate via a PID (Proportional, Integral, and Differential gain) feedback controller and DVFS (Dynamic Voltage and Frequency Scaling). The stabilized processor uses much less power per committed instruction, because of the reduced average frequency. The EPI (Energy Per Instruction) is also cut by an average of 28% across our benchmark programs. Since a stable MIPS rate is maintained consistently with lower power/energy per instruction, OoO processors stabilized by a feedback controller can realistically be deployed in real-time systems. To demonstrate this capability we select a subset of the MiBench benchmarks that displays the widest execution rate variations and stabilize their MIPS rate in the context of a 1GHz Pentium III-like microarchitecture.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"18 1","pages":"46-56"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81759299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

A fault tolerant, area efficient architecture for Shor's factoring algorithm 肖尔因子分解算法的容错高效结构

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555802

M. Whitney, Nemanja Isailovic, Yatish Patel, J. Kubiatowicz

We optimize the area and latency of Shor's factoring while simultaneously improving fault tolerance through: (1) balancing the use of ancilla generators, (2) aggressive optimization of error correction, and (3) tuning the core adder circuits. Our custom CAD flow produces detailed layouts of the physical components and utilizes simulation to analyze circuits in terms of area, latency, and success probability. We introduce a metric, called ADCR, which is the probabilistic equivalent of the classic Area-Delay product. Our error correction optimization can reduce ADCR by order of magnitude or more. Contrary to conventional wisdom, we show that the area of an optimized quantum circuit is not dominated exclusively by error correction. Further, our adder evaluation shows that quantum carry-lookahead adders (QCLA) beat ripple-carry adders in ADCR, despite being larger and more complex. We conclude with what we believe is one of most accurate estimates of the area and latency required for 1024-bit Shor's factorization: 7659 mm2 for the smallest circuit and 6 x 108 seconds for the fastest circuit.

我们优化了Shor分解的面积和延迟，同时通过以下方式提高容错性:(1)平衡辅助生成器的使用，(2)积极优化纠错，(3)调整核心加法器电路。我们的定制CAD流程生成物理组件的详细布局，并利用仿真来分析电路的面积，延迟和成功概率。我们引入一个称为ADCR的度量，它是经典区域延迟积的概率等价。我们的误差校正优化可以将ADCR降低一个数量级或更多。与传统观点相反，我们表明优化量子电路的区域并不完全由误差校正控制。此外，我们的加法器评估表明，量子进位前瞻加法器(QCLA)在ADCR中优于波纹进位加法器，尽管它更大更复杂。我们得出的结论是我们认为1024位Shor分解所需的面积和延迟最准确的估计之一:最小电路为7659 mm2，最快电路为6 x 108秒。

{"title":"A fault tolerant, area efficient architecture for Shor's factoring algorithm","authors":"M. Whitney, Nemanja Isailovic, Yatish Patel, J. Kubiatowicz","doi":"10.1145/1555754.1555802","DOIUrl":"https://doi.org/10.1145/1555754.1555802","url":null,"abstract":"We optimize the area and latency of Shor's factoring while simultaneously improving fault tolerance through: (1) balancing the use of ancilla generators, (2) aggressive optimization of error correction, and (3) tuning the core adder circuits. Our custom CAD flow produces detailed layouts of the physical components and utilizes simulation to analyze circuits in terms of area, latency, and success probability. We introduce a metric, called ADCR, which is the probabilistic equivalent of the classic Area-Delay product. Our error correction optimization can reduce ADCR by order of magnitude or more. Contrary to conventional wisdom, we show that the area of an optimized quantum circuit is not dominated exclusively by error\u0000 correction. Further, our adder evaluation shows that quantum carry-lookahead adders (QCLA) beat ripple-carry adders in ADCR, despite being larger and more complex. We conclude with what we believe is one of most accurate estimates of the area and latency required for 1024-bit Shor's factorization: 7659 mm2 for the smallest circuit and 6 x 108 seconds for the fastest circuit.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"66 1","pages":"383-394"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89280487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53