首页 > 最新文献

Proceedings. International Symposium on Computer Architecture最新文献

英文 中文
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor 同步推测线程:在sun的岩石处理器中实现的一种新颖的管道架构
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555814
Shailender Chaudhry, R. Cypher, M. Ekman, Martin Karlsson, A. Landin, S. Yip, Håkan Zeffer, M. Tremblay
This paper presents Simultaneous Speculative Threading (SST), which is a technique for creating high-performance area- and power-efficient cores for chip multiprocessors. SST hardware dynamically extracts two threads of execution from a single sequential program (one consisting of a load miss and its dependents, and the other consisting of the instructions that are independent of the load miss) and executes them in parallel. SST uses an efficient checkpointing mechanism to eliminate the need for complex and power-inefficient structures such as register renaming logic, reorder buffers, memory disambiguation buffers, and large issue windows. Simulations of certain SST implementations show 18% better per-thread performance on commercial benchmarks than larger and higher-powered out-of-order cores. Sun Microsystems' ROCK processor, which is the first processor to use SST cores, has been implemented and is scheduled to be commercially available in 2009.
本文介绍了同步推测线程(SST),这是一种为芯片多处理器创建高性能区域和节能内核的技术。SST硬件动态地从单个顺序程序中提取两个执行线程(一个由负载缺失及其依赖的线程组成,另一个由与负载缺失无关的指令组成)并并行执行它们。SST使用有效的检查点机制来消除对复杂和低功耗结构的需求,例如寄存器重命名逻辑,重新排序缓冲区,内存消歧缓冲区和大问题窗口。某些SST实现的模拟显示,在商业基准测试中,每线程性能比更大、更高功率的乱序内核高18%。Sun Microsystems的ROCK处理器是第一个使用SST内核的处理器,已经实现并计划于2009年投入商用。
{"title":"Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor","authors":"Shailender Chaudhry, R. Cypher, M. Ekman, Martin Karlsson, A. Landin, S. Yip, Håkan Zeffer, M. Tremblay","doi":"10.1145/1555754.1555814","DOIUrl":"https://doi.org/10.1145/1555754.1555814","url":null,"abstract":"This paper presents Simultaneous Speculative Threading (SST), which is a technique for creating high-performance area- and power-efficient cores for chip multiprocessors. SST hardware dynamically extracts two threads of execution from a single sequential program (one consisting of a load miss and its dependents, and the other consisting of the instructions that are independent of the load miss) and executes them in parallel. SST uses an efficient checkpointing mechanism to eliminate the need for complex and power-inefficient structures such as register renaming logic, reorder buffers, memory disambiguation buffers, and large issue windows. Simulations of certain SST implementations show 18% better per-thread performance on commercial benchmarks than larger and higher-powered out-of-order cores. Sun Microsystems' ROCK processor, which is the first processor to use SST cores, has been implemented and is scheduled to be commercially available in 2009.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"95 1","pages":"484-495"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80430434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Architecting phase change memory as a scalable dram alternative 将相变存储器架构为可扩展的dram替代品
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555758
Benjamin C. Lee, Engin Ipek, O. Mutlu, D. Burger
Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance. We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.
由于电荷存储和传感机制对于流行的存储器技术(如DRAM)变得不那么可靠,内存扩展处于危险之中。相比之下,相变存储器(PCM)存储依赖于可扩展的电流和热机制。为了利用PCM作为DRAM替代品的可扩展性,PCM的架构必须能够解决相对较长的延迟、高能量写入和有限的耐用性。我们从对PCM技术参数的基本理解出发,提出了区域中立的架构增强,以解决这些限制,并使PCM与DRAM竞争。基准PCM系统比DRAM系统慢1.6倍,需要2.2倍的能量。缓冲区重组将这种延迟和能量差距减少到1.2倍和1.0倍,使用窄行来减少写能量,使用多行来改善局域性和写合并。部分写入提高了内存的持久性,提供了5.6年的寿命。工艺缩放将进一步降低PCM能源成本并提高耐用性。
{"title":"Architecting phase change memory as a scalable dram alternative","authors":"Benjamin C. Lee, Engin Ipek, O. Mutlu, D. Burger","doi":"10.1145/1555754.1555758","DOIUrl":"https://doi.org/10.1145/1555754.1555758","url":null,"abstract":"Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance.\u0000 We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"11 1","pages":"2-13"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85742514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1491
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices 去耦DIMM:使用低速DRAM器件构建高带宽存储系统
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555788
Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu
The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM. Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.
多核处理器的广泛应用极大地提高了存储系统对高带宽和大容量的要求。在传统的DDR2/DDR3 DRAM存储系统中,内存总线和DRAM设备以相同的数据速率运行。为了提高内存带宽,我们提出了一种新的内存系统设计,称为去耦DIMM,它允许内存总线以比DRAM设备高得多的数据速率运行。在该设计中,在慢速DRAM设备和快速存储器总线之间添加同步缓冲区来中继数据;并对内存访问调度进行了改进,以避免内存rank上的访问冲突。该设计不仅提高了当前存储设备所能支持的内存带宽,而且通过使用相对较慢的存储设备,提高了可靠性、功率效率和成本效益。解耦的思想,精确地解耦存储器总线和单级设备之间的带宽匹配,也可以应用于包括FB-DIMM在内的其他类型的存储器系统。实验结果表明,总线数据速率为2667MT/s,器件数据速率为1333MT/s的解耦DIMM系统在内存密集型工作负载下的性能比传统的1333MT/s数据速率的内存系统平均提高51%。另外,总线数据速率为1600MT/s,器件数据速率为800MT/s的解耦DIMM系统,与传统的1600MT/s数据速率系统相比,性能损失仅为8%,内存功耗降低16%,内存能耗节省9%。
{"title":"Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices","authors":"Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu","doi":"10.1145/1555754.1555788","DOIUrl":"https://doi.org/10.1145/1555754.1555788","url":null,"abstract":"The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM.\u0000 Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"1 1","pages":"255-266"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78187217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Scaling the bandwidth wall: challenges in and avenues for CMP scaling 扩展带宽墙:CMP扩展的挑战和途径
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555801
Brian Rogers, A. Krishna, Gordon B. Bell, K. V. Vu, Xiaowei Jiang, Yan Solihin
As transistor density continues to grow at an exponential rate in accordance to Moore's law, the goal for many Chip Multi-Processor (CMP) systems is to scale the number of on-chip cores proportionally. Unfortunately, off-chip memory bandwidth capacity is projected to grow slowly compared to the desired growth in the number of cores. This creates a situation in which each core will have a decreasing amount of off-chip bandwidth that it can use to load its data from off-chip memory. The situation in which off-chip bandwidth is becoming a performance and throughput bottleneck is referred to as the bandwidth wall problem. In this study, we seek to answer two questions: (1) to what extent does the bandwidth wall problem restrict future multicore scaling, and (2) to what extent are various bandwidth conservation techniques able to mitigate this problem. To address them, we develop a simple but powerful analytical model to predict the number of on-chip cores that a CMP can support given a limited growth in memory traffic capacity. We find that the bandwidth wall can severely limit core scaling. When starting with a balanced 8-core CMP, in four technology generations the number of cores can only scale to 24, as opposed to 128 cores under proportional scaling, without increasing the memory traffic requirement. We find that various individual bandwidth conservation techniques we evaluate have a wide ranging impact on core scaling, and when combined together, these techniques have the potential to enable super-proportional core scaling for up to 4 technology generations.
随着晶体管密度按照摩尔定律继续以指数速率增长,许多芯片多处理器(CMP)系统的目标是按比例缩放片上内核的数量。不幸的是,与内核数量的预期增长相比,片外内存带宽容量预计增长缓慢。这就造成了这样一种情况,即每个内核用于从片外内存加载数据的片外带宽将越来越少。片外带宽成为性能和吞吐量瓶颈的情况被称为带宽墙问题。在本研究中,我们试图回答两个问题:(1)带宽墙问题在多大程度上限制了未来的多核扩展,以及(2)各种带宽保护技术能够在多大程度上缓解这一问题。为了解决这些问题,我们开发了一个简单但功能强大的分析模型来预测在内存流量有限的情况下,CMP可以支持的片上内核数量。我们发现带宽墙会严重限制核心扩展。当从平衡的8核CMP开始时,在四代技术中,核心数量只能扩展到24个,而在比例扩展下可以扩展到128个核心,而不会增加内存流量需求。我们发现,我们评估的各种单独的带宽保护技术对核心扩展有广泛的影响,当结合在一起时,这些技术有可能为多达4代技术实现超比例的核心扩展。
{"title":"Scaling the bandwidth wall: challenges in and avenues for CMP scaling","authors":"Brian Rogers, A. Krishna, Gordon B. Bell, K. V. Vu, Xiaowei Jiang, Yan Solihin","doi":"10.1145/1555754.1555801","DOIUrl":"https://doi.org/10.1145/1555754.1555801","url":null,"abstract":"As transistor density continues to grow at an exponential rate in accordance to Moore's law, the goal for many Chip Multi-Processor (CMP) systems is to scale the number of on-chip cores proportionally. Unfortunately, off-chip memory bandwidth capacity is projected to grow slowly compared to the desired growth in the number of cores. This creates a situation in which each core will have a decreasing amount of off-chip bandwidth that it can use to load its data from off-chip memory. The situation in which off-chip bandwidth is becoming a performance and throughput bottleneck is referred to as the bandwidth wall problem.\u0000 In this study, we seek to answer two questions: (1) to what extent does the bandwidth wall problem restrict future multicore scaling, and (2) to what extent are various bandwidth conservation techniques able to mitigate this problem. To address them, we develop a simple but powerful analytical model to predict the number of on-chip cores that a CMP can support given a limited growth in memory traffic capacity. We find that the bandwidth wall can severely limit core scaling. When starting with a balanced 8-core CMP, in four technology generations the number of cores can only scale to 24, as opposed to 128 cores under proportional scaling, without increasing the memory traffic requirement. We find that various individual bandwidth conservation techniques we evaluate have a wide ranging impact on core scaling, and when combined together, these techniques have the potential to enable super-proportional core scaling for up to 4 technology generations.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"22 1","pages":"371-382"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89310546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 305
Spatio-temporal memory streaming 时空记忆流
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555766
Stephen Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi
Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted. In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively.
最近的研究提倡内存流技术来缓解芯片外存储器访问的高延迟所造成的性能瓶颈。时间记忆流回放先前观察到的缺失序列,以消除长链的依赖缺失。空间内存流预测固定大小内存区域内的重复数据布局模式。由于每种技术针对的是不同的失误子集,因此它们的有效性因工作负载而异,并且每种技术都会留下很大一部分无法预测的失误。在本文中,我们提出了时空记忆流(stem)来利用空间和时间流之间的协同作用。我们观察到空间访问的顺序在区域内和跨区域重复。stem记录和重放区域访问的时间序列,并使用每个区域内的空间关系来动态重建预测的总缺失顺序。在一组商业工作负载中使用跟踪驱动和周期精确的模拟,我们证明了在与时间流相似的实现复杂性下,stem实现了与单独的空间或时间内存流相同或更高的覆盖范围,并且在跨距、空间和时间预测方面分别提高了31%、3%和18%的性能。
{"title":"Spatio-temporal memory streaming","authors":"Stephen Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi","doi":"10.1145/1555754.1555766","DOIUrl":"https://doi.org/10.1145/1555754.1555766","url":null,"abstract":"Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted.\u0000 In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"50 1","pages":"69-80"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82621486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 139
A durable and energy efficient main memory using phase change memory technology 一种使用相变存储技术的耐用且节能的主存储器
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555759
Ping Zhou, Bo Zhao, Jun Yang, Youtao Zhang
Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU. In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.
由于非易失性存储器对存储单元的泄漏功率为零,因此研究了在存储层中使用非易失性存储器来降低其能量消耗。然而,其中一个困难是,大多数非易失性存储技术的耐用性比传统的SRAM和DRAM技术短得多。这将它的使用限制在内存层次结构的较低级别,例如,远离CPU的磁盘。本文研究了一种新型的非易失性存储器——相变存储器(PCM)作为三维堆叠芯片的主存储器。与传统的DRAM技术相比,我们面临的主要挑战是有限的PCM耐用性,更长的访问延迟和更高的动态功率。我们建议将PCM的寿命延长到平均13年(MLC PCM电池)到22年(SLC PCM)。我们还研究了实现PCM的设计选择,以实现能量和性能之间的最佳权衡。我们的设计将相同容量的低功耗DRAM主存储器的总能量降低了65%,能量延迟产品降低了60%。这些结果表明,在主存储器中使用PCM技术来代替DRAM以获得更好的能源效率是可行的。
{"title":"A durable and energy efficient main memory using phase change memory technology","authors":"Ping Zhou, Bo Zhao, Jun Yang, Youtao Zhang","doi":"10.1145/1555754.1555759","DOIUrl":"https://doi.org/10.1145/1555754.1555759","url":null,"abstract":"Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU.\u0000 In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"30 1","pages":"14-23"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81867657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 944
SigRace: signature-based data race detection SigRace:基于签名的数据竞争检测
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555797
A. Muzahid, D. S. Gracia, Shanxiang Qi, J. Torrellas
Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lose their capability when lines get displaced or invalidated from the cache. To eliminate these shortcomings, this paper proposes a novel, different approach to hardware-assisted data race detection. The approach, called SigRace, relies on hardware address signatures. As a processor runs, the addresses of the data that it accesses are automatically encoded in signatures. At certain times, the signatures are automatically passed to a hardware module that intersects them with those of other processors. If the intersection is not null, a data race may have occurred. This paper presents the architecture of SigRace, an implementation, and its software interface. With SigRace, caches and coherence protocol messages are unmodified. Moreover, cache lines can be displaced and invalidated with no effect. Our experiments show that SigRace is significantly more effective than a state-of-the-art conventional hardware-assisted race detector. SigRace finds on average 29% more static races and 107% more dynamic races. Moreover, if we inject data races, SigRace finds 150% more static races than the conventional scheme.
在并行程序中检测数据竞争对于软件开发和生产运行诊断都很重要。最近,有几个关于硬件辅助数据争用检测的建议。这样的建议通常修改L1缓存和缓存一致性协议消息,并且当线路从缓存中被替换或失效时,它们在很大程度上失去了功能。为了消除这些缺点,本文提出了一种新颖的、不同的硬件辅助数据争用检测方法。这种被称为SigRace的方法依赖于硬件地址签名。当处理器运行时,它所访问的数据的地址被自动编码到签名中。在某些时候,这些签名被自动传递到一个硬件模块,该模块将它们与其他处理器的签名相交。如果交集不为空,则可能发生了数据争用。本文介绍了SigRace的体系结构、实现及其软件接口。使用SigRace,缓存和一致性协议消息不会被修改。此外,缓存线可以被替换或无效,而不会产生任何影响。我们的实验表明,SigRace明显比最先进的传统硬件辅助种族检测器更有效。SigRace发现静态比赛平均增加29%,动态比赛平均增加107%。此外,如果我们注入数据竞争,SigRace发现的静态竞争比传统方案多150%。
{"title":"SigRace: signature-based data race detection","authors":"A. Muzahid, D. S. Gracia, Shanxiang Qi, J. Torrellas","doi":"10.1145/1555754.1555797","DOIUrl":"https://doi.org/10.1145/1555754.1555797","url":null,"abstract":"Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lose their capability when lines get displaced or invalidated from the cache. To eliminate these shortcomings, this paper proposes a novel, different approach to hardware-assisted data race detection. The approach, called SigRace, relies on hardware address signatures. As a processor runs, the addresses of the data that it accesses are automatically encoded in signatures. At certain times, the signatures are automatically passed to a hardware module that intersects them with those of other processors. If the intersection is not null, a data race may have occurred.\u0000 This paper presents the architecture of SigRace, an implementation, and its software interface. With SigRace, caches and coherence protocol messages are unmodified. Moreover, cache lines can be displaced and invalidated with no effect. Our experiments show that SigRace is significantly more effective than a state-of-the-art conventional hardware-assisted race detector. SigRace finds on average 29% more static races and 107% more dynamic races. Moreover, if we inject data races, SigRace finds 150% more static races than the conventional scheme.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"67 1","pages":"337-348"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76391477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 97
A case for an interleaving constrained shared-memory multi-processor 交错约束共享内存多处理器的一种情况
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555796
Jie Yu, S. Narayanasamy
Shared-memory multi-threaded programming is inherently more difficult than single-threaded programming. The main source of complexity is that, the threads of an application can interleave in so many different ways. To ensure correctness, a programmer has to test all possible thread interleavings, which, however, is impractical. Many rare thread interleavings remain untested in production systems, and they are the root cause for a majority of concurrency bugs. We propose a shared-memory multi-processor design that avoids untested interleavings to improve the correctness of a multi-threaded program. Since untested interleavings tend to occur infrequently at runtime, the performance cost of avoiding them is not high. We propose to encode the set of tested correct interleavings in a program's binary executable using Predecessor Set (PSet) constraints. These constraints are efficiently enforced at runtime using processor support, which ensures that the runtime follows a tested interleaving. We analyze several bugs in open source applications such as MySQL, Apache, Mozilla, etc., and show that, by enforcing PSet constraints, we can avoid not only data races and atomicity violations, but also other forms of concurrency bugs.
共享内存多线程编程本质上比单线程编程更困难。复杂性的主要来源是,应用程序的线程可以以许多不同的方式交错。为了确保正确性,程序员必须测试所有可能的线程交织,然而,这是不切实际的。许多罕见的线程交织在生产系统中仍然没有经过测试,它们是大多数并发错误的根本原因。我们提出一种共享内存多处理器设计,避免未经测试的交织,以提高多线程程序的正确性。由于未经测试的交错在运行时很少发生,因此避免它们的性能成本并不高。我们建议使用前导集(PSet)约束对程序二进制可执行文件中经过测试的正确交织集进行编码。使用处理器支持在运行时有效地实施这些约束,从而确保运行时遵循经过测试的交错。我们分析了开源应用程序(如MySQL、Apache、Mozilla等)中的几个错误,并表明,通过实施PSet约束,我们不仅可以避免数据竞争和原子性冲突,还可以避免其他形式的并发错误。
{"title":"A case for an interleaving constrained shared-memory multi-processor","authors":"Jie Yu, S. Narayanasamy","doi":"10.1145/1555754.1555796","DOIUrl":"https://doi.org/10.1145/1555754.1555796","url":null,"abstract":"Shared-memory multi-threaded programming is inherently more difficult than single-threaded programming. The main source of complexity is that, the threads of an application can interleave in so many different ways. To ensure correctness, a programmer has to test all possible thread interleavings, which, however, is impractical.\u0000 Many rare thread interleavings remain untested in production systems, and they are the root cause for a majority of concurrency bugs. We propose a shared-memory multi-processor design that avoids untested interleavings to improve the correctness of a multi-threaded program. Since untested interleavings tend to occur infrequently at runtime, the performance cost of avoiding them is not high.\u0000 We propose to encode the set of tested correct interleavings in a program's binary executable using Predecessor Set (PSet) constraints. These constraints are efficiently enforced at runtime using processor support, which ensures that the runtime follows a tested interleaving. We analyze several bugs in open source applications such as MySQL, Apache, Mozilla, etc., and show that, by enforcing PSet constraints, we can avoid not only data races and atomicity violations, but also other forms of concurrency bugs.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"35 1","pages":"325-336"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81915398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 180
Architectural core salvaging in a multi-core processor for hard-error tolerance 多核处理器中的体系结构核心回收,以实现硬错误容错性
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555769
Michael D. Powell, Arijit Biswas, S. Gupta, Shubhendu S. Mukherjee
The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.
由于总核心面积的增加,cpu中的硬错误发生率是未来多核设计的一个挑战。即使预先知道硬错误的位置和性质,无论是在制造时还是在现场,在没有硬错误容忍度的情况下,也必须禁用具有此类错误的磁芯。缓存具有规则和重复的结构,通过提供备用数组或备用行很容易避免硬错误,而核心中的结构既不规则也不重复。先前的工作提出了微架构核心回收,以利用核心内的结构冗余并在存在硬错误的情况下保持功能。不幸的是,微架构回收引入了复杂性,并且由于核心中缺乏自然冗余,可能只能提供有限的核心区域覆盖,以应对硬错误。本文提出了一个建筑核心回收的案例。我们观察到,即使一些单独的内核不能执行某些操作,CPU芯片也可以符合指令集架构(ISA),即通过利用自然的跨核冗余来执行其ISA所需的所有指令。我们建议使用硬件将有问题的线程迁移到另一个可以执行操作的核心。体系结构核心回收可以覆盖很大的核心区域,防止出现故障,并且可以通过利用最小化微体系结构更改的已知技术来实现。我们展示了优化架构核心回收是可能的,这样在有故障的芯片上的性能就可以接近无故障芯片的性能——确保在许多工作负载下,性能明显优于禁用核心,而在其余工作负载下,性能不会比禁用核心差。
{"title":"Architectural core salvaging in a multi-core processor for hard-error tolerance","authors":"Michael D. Powell, Arijit Biswas, S. Gupta, Shubhendu S. Mukherjee","doi":"10.1145/1555754.1555769","DOIUrl":"https://doi.org/10.1145/1555754.1555769","url":null,"abstract":"The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core.\u0000 This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"126 1","pages":"93-104"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73004286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 137
ECMon: exposing cache events for monitoring ECMon:公开缓存事件以供监控
Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555798
V. Nagarajan, Rajiv Gupta
The advent of multicores has introduced new challenges for programmers to provide increased performance and software reliability. There has been significant interest in techniques that use software speculation to better utilize the computational power of multicores. At the same time, several recent proposals for ensuring software reliability are not applicable in a multicore setting due to their inability to handle interprocessor shared memory dependences (ISMDs). The demands for performing speculation and ensuring software reliability in a multicore setting, although seemingly different, share a common requirement: the need for monitoring program execution and collecting interprocessor dependence information at low overhead. For example, an important component of speculation is the effcient detection of missspeculation which in turn requires dependence information. Likewise, tasks that help ensure software reliability on multicores, including recording for replay, require ISMD information. In this paper, we propose ECMon: support for exposing cache events to the software. This enables the programmer to catch these events and react to them; in effect, efficiently exposing the ISMDs to the programmer. In the context of speculation, we show how ECMon optimizes the detection of miss-speculation; we use this simple support to speculate past active barriers and achieve a speedup of 12% for the set of parallel programs considered. As an application of ensuring software reliability, we show how ECMon can be used to record shared memory dependences on multicores using no specialized hardware support at only 2.8 fold execution time overhead.
多核的出现为程序员提供更高的性能和软件可靠性带来了新的挑战。人们对利用软件推测来更好地利用多核计算能力的技术非常感兴趣。与此同时,最近一些确保软件可靠性的建议并不适用于多核设置,因为它们无法处理处理器间共享内存依赖(ismd)。在多核设置中执行推测和确保软件可靠性的需求虽然看起来不同,但有一个共同的需求:需要以低开销监视程序执行和收集处理器间依赖信息。例如,投机的一个重要组成部分是对错误投机的有效检测,而错误投机反过来又需要依赖信息。同样,有助于确保多核软件可靠性的任务,包括为重播录制,也需要ISMD信息。在本文中,我们提出了ECMon:支持向软件公开缓存事件。这使程序员能够捕获这些事件并对它们做出反应;实际上,有效地将ismd暴露给程序员。在投机的背景下,我们展示了ECMon如何优化对投机失误的检测;我们使用这个简单的支持来推测过去的活动障碍,并为所考虑的并行程序集实现了12%的加速。作为确保软件可靠性的应用程序,我们展示了如何使用ECMon在没有专门硬件支持的情况下记录多核上的共享内存依赖,其执行时间开销仅为2.8倍。
{"title":"ECMon: exposing cache events for monitoring","authors":"V. Nagarajan, Rajiv Gupta","doi":"10.1145/1555754.1555798","DOIUrl":"https://doi.org/10.1145/1555754.1555798","url":null,"abstract":"The advent of multicores has introduced new challenges for programmers to provide increased performance and software reliability. There has been significant interest in techniques that use software speculation to better utilize the computational power of multicores. At the same time, several recent proposals for ensuring software reliability are not applicable in a multicore setting due to their inability to handle interprocessor shared memory dependences (ISMDs). The demands for performing speculation and ensuring software reliability in a multicore setting, although seemingly different, share a common requirement: the need for monitoring program execution and collecting interprocessor dependence information at low overhead. For example, an important component of speculation is the effcient detection of missspeculation which in turn requires dependence information. Likewise, tasks that help ensure software reliability on multicores, including recording for replay, require ISMD information.\u0000 In this paper, we propose ECMon: support for exposing cache events to the software. This enables the programmer to catch these events and react to them; in effect, efficiently exposing the ISMDs to the programmer. In the context of speculation, we show how ECMon optimizes the detection of miss-speculation; we use this simple support to speculate past active barriers and achieve a speedup of 12% for the set of parallel programs considered. As an application of ensuring software reliability, we show how ECMon can be used to record shared memory dependences on multicores using no specialized hardware support at only 2.8 fold execution time overhead.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"58 1","pages":"349-360"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80601153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
期刊
Proceedings. International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1