首页 > 最新文献

2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献

英文 中文
Researching Novel Systems: To Instantiate, Emulate, Simulate, or Analyticate? 研究新系统:实例化,仿真,模拟,还是分析?
Pub Date : 2007-08-10 DOI: 10.1109/HPCA.2007.346203
D. Burger, J. Emer, Phil Emma, S. Keckler, Y. Patt, D. Patterson
The computer architecture research community has a rich menu of methodological options, which includes building full system prototypes, measuring in simulation, emulating on FPGAs, or constructing sophisticated analytic models. However, building custom systems has become enormously expensive, especially given the current funding climate. Simulations have become enormously complex as well, often including full operating systems. Analytic models have become less popular as system complexity has grown. Finally, some argue that FPGA emulation of hardware is the right approach for the future, while others opine that it is the worst of all worlds. This panel will debate these various points of view, which are of great interest to the funding sponsors of our community.
计算机体系结构研究社区有丰富的方法选择菜单,包括构建完整的系统原型,在仿真中测量,在fpga上仿真,或构建复杂的分析模型。然而,构建定制系统已经变得非常昂贵,特别是考虑到当前的资金环境。模拟也变得非常复杂,通常包括完整的操作系统。随着系统复杂性的增长,分析模型变得不那么流行了。最后,一些人认为硬件的FPGA仿真是未来的正确方法,而另一些人则认为这是最糟糕的。这个小组将讨论这些不同的观点,这些观点是我们社区的资助人非常感兴趣的。
{"title":"Researching Novel Systems: To Instantiate, Emulate, Simulate, or Analyticate?","authors":"D. Burger, J. Emer, Phil Emma, S. Keckler, Y. Patt, D. Patterson","doi":"10.1109/HPCA.2007.346203","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346203","url":null,"abstract":"The computer architecture research community has a rich menu of methodological options, which includes building full system prototypes, measuring in simulation, emulating on FPGAs, or constructing sophisticated analytic models. However, building custom systems has become enormously expensive, especially given the current funding climate. Simulations have become enormously complex as well, often including full operating systems. Analytic models have become less popular as system complexity has grown. Finally, some argue that FPGA emulation of hardware is the right approach for the future, while others opine that it is the worst of all worlds. This panel will debate these various points of view, which are of great interest to the funding sponsors of our community.","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134414167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors 热羊群:用于控制高性能3d集成处理器热点的微架构技术
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346197
Kiran Puttaswamy, G. Loh
3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of thermal herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink. Our 3D/thermal-aware microarchitecture contributions include a significance-partitioned datapath that places the frequently switching 16-bits on the top die, a 3D-aware instruction scheduler allocation scheme, an address memorization approach for the load and store queues, a partial value encoding for the L1 data cache, and a branch target buffer that exploits a form of frequent partial value locality in target addresses. Compared to a conventional planar processor, our 3D processor achieves a 47.9% frequency increase which results in a 47.0% performance improvement (min 7%, max 77% on individual benchmarks), while simultaneously reducing total power by 20% (min 15%, max 30%). Without our thermal herding techniques, the worst-case 3D temperature increases by 17 degrees. With our thermal herding techniques, the temperature increase is only 12 degrees (29% reduction in the 3D worst-case temperature increase)
3D集成技术大大增加了晶体管密度,同时提供更快的片上通信。由于减少了关键线路,处理器的3D实现可以同时提供延迟和功耗优势。然而,有源器件的3D堆叠可能会加剧现有的热问题。在这项工作中,我们提出了一系列热聚集技术,(1)降低了3D功率密度,(2)将大部分功率定位在最靠近散热器的上模上。我们在3D/热感知微架构方面的贡献包括:将频繁切换的16位放在顶部芯片上的重要分区数据路径、3D感知指令调度器分配方案、用于加载和存储队列的地址记忆方法、用于L1数据缓存的部分值编码,以及利用目标地址中频繁部分值局域形式的分支目标缓冲区。与传统的平面处理器相比,我们的3D处理器实现了47.9%的频率提升,从而带来47.0%的性能提升(在单个基准测试中最小7%,最大77%),同时将总功耗降低20%(最小15%,最大30%)。没有我们的热群聚技术,最坏情况下3D温度会上升17度。使用我们的热群聚技术,温度升高仅为12度(3D最坏情况下温度升高降低29%)。
{"title":"Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors","authors":"Kiran Puttaswamy, G. Loh","doi":"10.1109/HPCA.2007.346197","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346197","url":null,"abstract":"3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of thermal herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink. Our 3D/thermal-aware microarchitecture contributions include a significance-partitioned datapath that places the frequently switching 16-bits on the top die, a 3D-aware instruction scheduler allocation scheme, an address memorization approach for the load and store queues, a partial value encoding for the L1 data cache, and a branch target buffer that exploits a form of frequent partial value locality in target addresses. Compared to a conventional planar processor, our 3D processor achieves a 47.9% frequency increase which results in a 47.0% performance improvement (min 7%, max 77% on individual benchmarks), while simultaneously reducing total power by 20% (min 15%, max 30%). Without our thermal herding techniques, the worst-case 3D temperature increases by 17 degrees. With our thermal herding techniques, the temperature increase is only 12 degrees (29% reduction in the 3D worst-case temperature increase)","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114424857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 166
A Memory-Level Parallelism Aware Fetch Policy for SMT Processors SMT处理器的内存级并行获取策略
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346201
Stijn Eyerman, L. Eeckhout
A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated by a staffed thread by identifying long-latency loads and preventing the given thread from fetching more instructions - and in some implementations, instructions beyond the long-latency load may even be flushed which frees allocated resources. This paper proposes an SMT fetch policy that hikes into account the available memory-level parallelism (MLP) in a thread. The key idea proposed in this paper is that in case of an isolated long-latency had. i.e. there is no MLP the thread should be prevented from allocating additional resources. However, in case multiple independent long-latency loads overlap, i.e., there is MLP the thread should allocate as many resources as needed in order to fully expose the available MLP. The proposed MLP-aware fetch policy achieves better performance for MLP-intensive threads on an SMT processor and achieves a better overall balance between performance and fairness than previously proposed fetch policies
在同步多线程(SMT)处理器上执行的线程如果经历了长延迟负载,最终会在占用执行资源时停机。现有的长延迟负载感知SMT获取策略通过识别长延迟负载和防止给定线程获取更多指令来限制配备线程分配的资源量——在某些实现中,超过长延迟负载的指令甚至可能被刷新,从而释放已分配的资源。本文提出了一种考虑线程中可用内存级并行性(MLP)的SMT提取策略。本文提出的关键思想是,在孤立的长延迟情况下。也就是说,如果没有MLP,应该阻止线程分配额外的资源。但是,如果多个独立的长延迟负载重叠,即存在MLP,则线程应该分配尽可能多的资源,以便完全公开可用的MLP。与之前提出的提取策略相比,所提出的mlp感知提取策略在SMT处理器上为mlp密集型线程实现了更好的性能,并且在性能和公平性之间实现了更好的总体平衡
{"title":"A Memory-Level Parallelism Aware Fetch Policy for SMT Processors","authors":"Stijn Eyerman, L. Eeckhout","doi":"10.1109/HPCA.2007.346201","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346201","url":null,"abstract":"A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated by a staffed thread by identifying long-latency loads and preventing the given thread from fetching more instructions - and in some implementations, instructions beyond the long-latency load may even be flushed which frees allocated resources. This paper proposes an SMT fetch policy that hikes into account the available memory-level parallelism (MLP) in a thread. The key idea proposed in this paper is that in case of an isolated long-latency had. i.e. there is no MLP the thread should be prevented from allocating additional resources. However, in case multiple independent long-latency loads overlap, i.e., there is MLP the thread should allocate as many resources as needed in order to fully expose the available MLP. The proposed MLP-aware fetch policy achieves better performance for MLP-intensive threads on an SMT processor and achieves a better overall balance between performance and fairness than previously proposed fetch policies","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129128995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Interactions Between Compression and Prefetching in Chip Multiprocessors 芯片多处理器中压缩与预取的相互作用
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346200
Alaa R. Alameldeen, D. Wood
In chip multiprocessors (CMPs), multiple cores compete for shared resources such as on-chip caches and off-chip pin bandwidth. Stride-based hardware prefetching increases demand for these resources, causing contention that can degrade performance (up to 35% for one of our benchmarks). In this paper, we first show that cache and link (off-chip interconnect) compression can increase the effective cache capacity (thereby reducing off-chip misses) and increase the effective off-chip bandwidth (reducing contention). On an 8-processor CMP with no prefetching, compression improves performance by up to 18% for commercial workloads. Second, we propose a simple adaptive prefetching mechanism that uses cache compressions extra tags to detect useless and harmful prefetches. Furthermore, in the central result of this paper, we show that compression and prefetching interact in a strong positive way, resulting in combined performance improvement of 10-51% for seven of our eight workloads
在芯片多处理器(cmp)中,多个内核竞争共享资源,如片上缓存和片外引脚带宽。基于跨行的硬件预取会增加对这些资源的需求,导致争用,从而降低性能(在我们的一个基准测试中,争用可达35%)。在本文中,我们首先证明了缓存和链路(片外互连)压缩可以增加有效的缓存容量(从而减少片外丢失)并增加有效的片外带宽(减少争用)。在没有预取的8处理器CMP上,压缩可将商业工作负载的性能提高18%。其次,我们提出了一个简单的自适应预取机制,该机制使用缓存压缩额外的标签来检测无用和有害的预取。此外,在本文的中心结果中,我们表明压缩和预取以一种强烈的积极方式相互作用,导致我们八个工作负载中的七个的综合性能提高了10-51%
{"title":"Interactions Between Compression and Prefetching in Chip Multiprocessors","authors":"Alaa R. Alameldeen, D. Wood","doi":"10.1109/HPCA.2007.346200","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346200","url":null,"abstract":"In chip multiprocessors (CMPs), multiple cores compete for shared resources such as on-chip caches and off-chip pin bandwidth. Stride-based hardware prefetching increases demand for these resources, causing contention that can degrade performance (up to 35% for one of our benchmarks). In this paper, we first show that cache and link (off-chip interconnect) compression can increase the effective cache capacity (thereby reducing off-chip misses) and increase the effective off-chip bandwidth (reducing contention). On an 8-processor CMP with no prefetching, compression improves performance by up to 18% for commercial workloads. Second, we propose a simple adaptive prefetching mechanism that uses cache compressions extra tags to detect useless and harmful prefetches. Furthermore, in the central result of this paper, we show that compression and prefetching interact in a strong positive way, resulting in combined performance improvement of 10-51% for seven of our eight workloads","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling 全缓冲内存架构:理解机制,开销和扩展
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346190
B. Ganesh, A. Jaleel, David T. Wang, B. Jacob
Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the fully-buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a fully-buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7% in latency and 10% in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25% and 10% respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed
传统上,内存的性能提升是通过增加内存总线的宽度和速度来实现的。由于这种技术的收益递减,人们提出了另一种架构,即全缓冲DIMM。这个新标准用内存控制器和内存条之间的窄高速接口取代了传统的内存总线。本文研究了传统的基于DDRx的内存控制器策略如何在全缓冲的DIMM内存架构上执行调度和行缓冲区管理。FBDIMM系统采用分总线架构,在高利用率的情况下,延迟平均提高7%,带宽平均提高10%。另一方面,在较低的利用率下,序列化成本的增加会导致延迟和带宽分别下降25%和10%。分离总线架构还使系统性能对工作负载中读写流量的比例非常敏感。在较大的配置中,我们发现FBDIMM系统性能对FBDIMM链路的使用比对DRAM库可用性更敏感。一般来说,FBDIMM的性能与DDRx系统相似,并且在更高的利用率下提供更好的性能特征,使其成为在更高带宽要求下扩展容量的相对廉价的机制。只要遵守某些基本规则,该机制在很大程度上对调度策略不敏感
{"title":"Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling","authors":"B. Ganesh, A. Jaleel, David T. Wang, B. Jacob","doi":"10.1109/HPCA.2007.346190","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346190","url":null,"abstract":"Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the fully-buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a fully-buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7% in latency and 10% in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25% and 10% respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125247744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 88
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing 为生产者-消费者共享优化的自适应缓存一致性协议
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346210
Liqun Cheng, J. Carter, Donglai Dai
Shared memory multiprocessors play an increasingly important role in enterprise and scientific computing facilities. Remote misses limit the performance of shared memory applications, and their significance is growing as network latency increases relative to processor speeds. This paper proposes two mechanisms that improve shared memory performance by eliminating remote misses and/or reducing the amount of communication required to maintain coherence. We focus on improving the performance of applications that exhibit producer-consumer sharing. We first present a simple hardware mechanism for detecting producer-consumer sharing. We then describe a directory delegation mechanism whereby the "home node" of a cache line can be delegated to a producer node, thereby converting 3-hop coherence operations into 2-hop operations. We then extend the delegation mechanism to support speculative updates for data accessed in a producer-consumer pattern, which can convert 2-hop misses into local misses, thereby eliminating the remote memory latency. Both mechanisms can be implemented without changes to the processor. We evaluate our directory delegation and speculative update mechanisms on seven benchmark programs that exhibit producer-consumer sharing using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor. We find that the mechanisms proposed in this paper reduce the average remote miss rate by 40%, reduce network traffic by 15%, and improve performance by 21%. Finally, we use Murphi to verify that each mechanism is error-free and does not violate sequential consistency
共享内存多处理器在企业和科学计算设施中发挥着越来越重要的作用。远程错失限制了共享内存应用程序的性能,并且随着网络延迟相对于处理器速度的增加,它们的重要性也在增加。本文提出了两种机制,通过消除远程遗漏和/或减少保持一致性所需的通信量来提高共享内存性能。我们专注于提高展现生产者-消费者共享的应用程序的性能。我们首先提出了一个简单的硬件机制来检测生产者-消费者共享。然后,我们描述了一种目录委托机制,通过该机制,缓存线的“主节点”可以委托给生产者节点,从而将3跳相干操作转换为2跳操作。然后,我们扩展委托机制,以支持以生产者-消费者模式访问的数据的推测更新,这可以将2跳错误转换为本地错误,从而消除远程内存延迟。这两种机制都可以在不更改处理器的情况下实现。我们使用未来16节点SGI多处理器的周期精确执行驱动模拟器,在七个基准程序上评估了我们的目录委托和推测更新机制,这些程序展示了生产者-消费者共享。我们发现,本文提出的机制将平均远程脱靶率降低了40%,将网络流量降低了15%,并将性能提高了21%。最后,我们使用Murphi来验证每个机制都是无错误的,并且不违反顺序一致性
{"title":"An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing","authors":"Liqun Cheng, J. Carter, Donglai Dai","doi":"10.1109/HPCA.2007.346210","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346210","url":null,"abstract":"Shared memory multiprocessors play an increasingly important role in enterprise and scientific computing facilities. Remote misses limit the performance of shared memory applications, and their significance is growing as network latency increases relative to processor speeds. This paper proposes two mechanisms that improve shared memory performance by eliminating remote misses and/or reducing the amount of communication required to maintain coherence. We focus on improving the performance of applications that exhibit producer-consumer sharing. We first present a simple hardware mechanism for detecting producer-consumer sharing. We then describe a directory delegation mechanism whereby the \"home node\" of a cache line can be delegated to a producer node, thereby converting 3-hop coherence operations into 2-hop operations. We then extend the delegation mechanism to support speculative updates for data accessed in a producer-consumer pattern, which can convert 2-hop misses into local misses, thereby eliminating the remote memory latency. Both mechanisms can be implemented without changes to the processor. We evaluate our directory delegation and speculative update mechanisms on seven benchmark programs that exhibit producer-consumer sharing using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor. We find that the mechanisms proposed in this paper reduce the average remote miss rate by 40%, reduce network traffic by 15%, and improve performance by 21%. Finally, we use Murphi to verify that each mechanism is error-free and does not violate sequential consistency","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134562058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications 扩展多核架构,利用单线程应用程序中的混合并行性
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346182
Hongtao Zhong, Steven A. Lieberman, S. Mahlke
Chip multiprocessors with multiple simpler cores are gaining popularity because they have the potential to drive future performance gains without exacerbating the problems of power dissipation and complexity. Current chip multiprocessors increase throughput by utilizing multiple cores to perform computation in parallel. These designs provide real benefits for server-class applications that are explicitly multi-threaded. However, for desktop and other systems where single-thread applications dominate, multicore systems have yet to offer much benefit. Chip multiprocessors are most efficient at executing coarse-grain threads that have little communication. However, general-purpose applications do not provide many opportunities for identifying such threads, due to frequent use of pointers, recursive data structures, if-then-else branches, small function bodies, and loops with small trip counts. To attack this mismatch, this paper proposes a multicore architecture, referred to as Voltron that extends traditional multicore systems in two ways. First, it provides a dual-mode scalar operand network to enable efficient inter-core communication and lightweight synchronization. Second, Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, the cores execute multiple instruction streams in lock-step to collectively function as a wide-issue VLIW. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. This paper describes the Voltron architecture and associated compiler support for orchestrating bi-modal execution
具有多个更简单内核的芯片多处理器越来越受欢迎,因为它们有可能在不加剧功耗和复杂性问题的情况下推动未来的性能提升。当前的芯片多处理器通过利用多核并行执行计算来提高吞吐量。这些设计为显式多线程的服务器类应用程序提供了真正的好处。然而,对于单线程应用程序占主导地位的桌面和其他系统,多核系统还没有提供很多好处。芯片多处理器在执行几乎没有通信的粗粒度线程时效率最高。然而,由于经常使用指针、递归数据结构、if-then-else分支、小函数体和行程计数小的循环,通用应用程序并没有为识别此类线程提供很多机会。为了解决这种不匹配问题,本文提出了一种多核架构,称为Voltron,它以两种方式扩展了传统的多核系统。首先,它提供了一个双模标量操作数网络,以实现高效的核间通信和轻量级同步。其次,Voltron可以以耦合或解耦模式组织内核执行。在耦合模式下,核心以锁步执行多个指令流,共同作为宽问题VLIW。在解耦模式下,内核执行一组由编译器提取的细粒度通信线程。本文描述了Voltron架构和相关的编译器对编排双模态执行的支持
{"title":"Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications","authors":"Hongtao Zhong, Steven A. Lieberman, S. Mahlke","doi":"10.1109/HPCA.2007.346182","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346182","url":null,"abstract":"Chip multiprocessors with multiple simpler cores are gaining popularity because they have the potential to drive future performance gains without exacerbating the problems of power dissipation and complexity. Current chip multiprocessors increase throughput by utilizing multiple cores to perform computation in parallel. These designs provide real benefits for server-class applications that are explicitly multi-threaded. However, for desktop and other systems where single-thread applications dominate, multicore systems have yet to offer much benefit. Chip multiprocessors are most efficient at executing coarse-grain threads that have little communication. However, general-purpose applications do not provide many opportunities for identifying such threads, due to frequent use of pointers, recursive data structures, if-then-else branches, small function bodies, and loops with small trip counts. To attack this mismatch, this paper proposes a multicore architecture, referred to as Voltron that extends traditional multicore systems in two ways. First, it provides a dual-mode scalar operand network to enable efficient inter-core communication and lightweight synchronization. Second, Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, the cores execute multiple instruction streams in lock-step to collectively function as a wide-issue VLIW. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. This paper describes the Voltron architecture and associated compiler support for orchestrating bi-modal execution","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115431436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 117
Evaluating MapReduce for Multi-core and Multiprocessor Systems 评估MapReduce在多核和多处理器系统中的应用
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346181
Colby Ranger, R. Raghuraman, Arun Penmetsa, G. Bradski, C. Kozyrakis
This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code
本文评估了MapReduce模型在多核和多处理器系统中的适用性。MapReduce是由谷歌创建的,用于在拥有数千台服务器的数据中心上开发应用程序。它允许程序员编写在分布式系统中自动并行和调度的函数式代码。我们描述了Phoenix,一个用于共享内存系统的MapReduce实现,它包括一个编程API和一个高效的运行时系统。Phoenix运行时自动管理线程创建、动态任务调度、数据分区和跨处理器节点的容错性。我们研究了Phoenix在多核和对称多处理器系统中的应用,并评估了它的性能潜力和错误恢复特性。我们还将MapReduce代码与用底层api(如p线程)编写的代码进行了比较。总的来说,我们建立了一个谨慎的实现,MapReduce是一个很有前途的模型,可以用简单的并行代码在共享内存系统上扩展性能
{"title":"Evaluating MapReduce for Multi-core and Multiprocessor Systems","authors":"Colby Ranger, R. Raghuraman, Arun Penmetsa, G. Bradski, C. Kozyrakis","doi":"10.1109/HPCA.2007.346181","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346181","url":null,"abstract":"This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1092
Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat 用恒温器对机架式服务器的热概况进行建模和管理
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346198
Jeonghwan Choi, Youngjae Kim, A. Sivasubramaniam, J. Srebric, Qian Wang, Joonwon Lee
High power densities and the implications of high operating temperatures on the failure rates of components are key driving factors of temperature-aware computing. Computer architects and system software designers need to understand the thermal consequences of their proposals, and develop techniques to lower operating temperatures to reduce both transient and permanent component failures. Tools for understanding temperature ramifications of designs have been mainly restricted to industry for studying packaging and cooling mechanisms, with little access to such toolsets for academic researchers. Developing such tools is an arduous task since it usually requires cross-cutting areas of expertise spanning architecture, systems software, thermodynamics, and cooling systems. Recognizing the need for such tools, there has been work on modeling temperatures of processors at the micro-architectural level which can be easily understood and employed by computer architects for processor designs. However, there is a dearth of such tools in the academic/research community for undertaking architectural/systems studies beyond a processor - a server box, rack or even a machine room. This paper presents a detailed 3-dimensional computational fluid dynamics based thermal modeling tool, called ThermoStat, for rack-mounted server systems. Using this tool, we model a 20 (each with dual Xeon processors) node rack-mounted server system, and validate it with over 30 temperature sensor measurements at different points in the servers/rack. We conduct several experiments with this tool to show how different load conditions affect the thermal profile, and also illustrate how this tool can help design dynamic thermal management techniques
高功率密度和高工作温度对元件故障率的影响是温度感知计算的关键驱动因素。计算机架构师和系统软件设计师需要了解他们的建议的热后果,并开发降低操作温度的技术,以减少瞬态和永久组件故障。用于理解设计的温度影响的工具主要局限于研究包装和冷却机制的工业,很少有学术研究人员可以使用这些工具集。开发这样的工具是一项艰巨的任务,因为它通常需要跨越架构、系统软件、热力学和冷却系统的跨领域专业知识。认识到需要这样的工具,已经有了在微体系结构级别上对处理器温度建模的工作,这些工作可以很容易地被计算机架构师用于处理器设计。然而,在学术/研究社区中,缺乏这样的工具来进行处理器之外的架构/系统研究——服务器箱、机架甚至机房。本文介绍了一种详细的基于计算流体力学的三维热建模工具,称为ThermoStat,用于机架式服务器系统。使用这个工具,我们建立了一个20(每个节点都有双至强处理器)节点机架服务器系统的模型,并在服务器/机架的不同点使用超过30个温度传感器测量来验证它。我们使用该工具进行了几个实验,以显示不同的负载条件如何影响热剖面,并说明该工具如何帮助设计动态热管理技术
{"title":"Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat","authors":"Jeonghwan Choi, Youngjae Kim, A. Sivasubramaniam, J. Srebric, Qian Wang, Joonwon Lee","doi":"10.1109/HPCA.2007.346198","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346198","url":null,"abstract":"High power densities and the implications of high operating temperatures on the failure rates of components are key driving factors of temperature-aware computing. Computer architects and system software designers need to understand the thermal consequences of their proposals, and develop techniques to lower operating temperatures to reduce both transient and permanent component failures. Tools for understanding temperature ramifications of designs have been mainly restricted to industry for studying packaging and cooling mechanisms, with little access to such toolsets for academic researchers. Developing such tools is an arduous task since it usually requires cross-cutting areas of expertise spanning architecture, systems software, thermodynamics, and cooling systems. Recognizing the need for such tools, there has been work on modeling temperatures of processors at the micro-architectural level which can be easily understood and employed by computer architects for processor designs. However, there is a dearth of such tools in the academic/research community for undertaking architectural/systems studies beyond a processor - a server box, rack or even a machine room. This paper presents a detailed 3-dimensional computational fluid dynamics based thermal modeling tool, called ThermoStat, for rack-mounted server systems. Using this tool, we model a 20 (each with dual Xeon processors) node rack-mounted server system, and validate it with over 30 temperature sensor measurements at different points in the servers/rack. We conduct several experiments with this tool to show how different load conditions affect the thermal profile, and also illustrate how this tool can help design dynamic thermal management techniques","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129521430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Optical Interconnect Opportunities for Future Server Memory Systems 未来服务器存储系统的光互连机会
Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346184
Y. Katayama, A. Okazaki
This paper deals with alternative server memory architecture options in multicore CPU generations using optically-attached memory systems. Thanks to its large bandwidth-distance product, optical interconnect technology enables CPUs and local memory to be placed meters away from each other without sacrificing bandwidth. This topologically-local but physically-remote main memory attached via an ultra-high-bandwidth parallel optical interconnect can lead to flexible memory architecture options using low-cost commodity memory technologies
本文讨论了使用光学附加存储器系统的多核CPU代中的备选服务器存储器体系结构选项。由于其大带宽距离产品,光互连技术可以使cpu和本地存储器彼此放置在几米远的地方,而不会牺牲带宽。这种拓扑本地但物理远程的主存储器通过超高带宽并行光互连连接,可以使用低成本的商品存储器技术实现灵活的存储器架构选择
{"title":"Optical Interconnect Opportunities for Future Server Memory Systems","authors":"Y. Katayama, A. Okazaki","doi":"10.1109/HPCA.2007.346184","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346184","url":null,"abstract":"This paper deals with alternative server memory architecture options in multicore CPU generations using optically-attached memory systems. Thanks to its large bandwidth-distance product, optical interconnect technology enables CPUs and local memory to be placed meters away from each other without sacrificing bandwidth. This topologically-local but physically-remote main memory attached via an ultra-high-bandwidth parallel optical interconnect can lead to flexible memory architecture options using low-cost commodity memory technologies","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114277563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2007 IEEE 13th International Symposium on High Performance Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1