2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献

英文中文

Improving Branch Prediction and Predicated Execution in Out-of-Order Processors 乱序处理器中分支预测和预测执行的改进

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346186

E. Quiñones, Joan-Manuel Parcerisa, Antonio González

If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on out-of-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation

if转换是一种编译器技术，可以减少由于难以预测的分支而导致的错误预测，将控制依赖项转换为数据依赖项。尽管它在全局上是有益的，但它有一个负面的副作用，因为删除分支会消除传统分支预测器所必需的有用的相关信息。剩下的分支可能会变得更难预测。然而，在具有比较分支模型的谓词isa中，相关信息不仅存在于分支中，而且存在于计算其保护谓词的比较指令中。当一个分支被移除时，它的相关信息在它的比较指令中仍然可用。提出了一种基于谓词预测的分支预测方案。它有三个优点:首先，由于预测不是在分支的基础上进行的，而是在谓词定义的基础上进行的，因此在if转换后去除分支不会丢失任何相关信息，因此精度不会降低。其次，我们提出的机制允许在可用时使用分支谓词的计算值，而不是预测值，从而有效地在此类早期解析分支上实现100%的准确性。第三，如前所述，选择性谓词预测是在乱序处理器上实现if转换的一种非常有效的技术，因为它避免了多个寄存器定义的问题，减少了无效指令的不必要资源消耗。因此，我们的方法可以非常有效地实现无序处理器的if转换，几乎不需要额外的硬件成本，因为使用相同的硬件来预测if转换代码的谓词和预测分支，而不会降低精度

{"title":"Improving Branch Prediction and Predicated Execution in Out-of-Order Processors","authors":"E. Quiñones, Joan-Manuel Parcerisa, Antonio González","doi":"10.1109/HPCA.2007.346186","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346186","url":null,"abstract":"If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on out-of-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Exploiting Postdominance for Speculative Parallelization 利用后优势进行推测并行化

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346207

Mayank Agarwal, Kshitiz Malik, Kevin M. Woley, S. S. Stone, M. Frank

Task-selection policies are critical to the performance of any architecture that uses speculation to extract parallel tasks from a sequential thread. This paper demonstrates that the immediate postdominators of conditional branches provide a larger set of parallel tasks than existing task-selection heuristics, which are limited to programming language constructs (such as loops or procedure calls). Our evaluation shows that postdominance-based task selection achieves, on average, more than double the speedup of the best individual heuristic, and 33% more speedup than the best combination of heuristics. The specific contributions of this paper include, first, a description of task selection based on immediate post-dominance for a system that speculatively creates tasks. Second, our experimental evaluation demonstrates that existing task-selection heuristics based on loops, procedure calls, and if-else statements are all subsumed by compiler-generated immediate postdominators. Finally, by demonstrating that dynamic reconvergence prediction closely approximates immediate postdominator analysis, we show that the notion of immediate postdominators may also be useful in constructing dynamic task selection mechanisms

任务选择策略对于任何使用推测从顺序线程中提取并行任务的架构的性能都是至关重要的。本文证明了条件分支的直接后支配者比现有的任务选择启发式提供了更大的并行任务集，而现有的任务选择启发式仅限于编程语言结构(如循环或过程调用)。我们的评估表明，基于后优势的任务选择的平均加速速度是最佳单个启发式的两倍多，比最佳启发式组合的加速速度高出33%。本文的具体贡献包括，首先，描述了基于即时后优势的任务选择，用于推测性创建任务的系统。其次，我们的实验评估表明，现有的基于循环、过程调用和if-else语句的任务选择启发式方法都包含在编译器生成的即时后支配符中。最后，通过证明动态再收敛预测非常接近即时后支配子分析，我们表明即时后支配子的概念也可以用于构建动态任务选择机制

{"title":"Exploiting Postdominance for Speculative Parallelization","authors":"Mayank Agarwal, Kshitiz Malik, Kevin M. Woley, S. S. Stone, M. Frank","doi":"10.1109/HPCA.2007.346207","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346207","url":null,"abstract":"Task-selection policies are critical to the performance of any architecture that uses speculation to extract parallel tasks from a sequential thread. This paper demonstrates that the immediate postdominators of conditional branches provide a larger set of parallel tasks than existing task-selection heuristics, which are limited to programming language constructs (such as loops or procedure calls). Our evaluation shows that postdominance-based task selection achieves, on average, more than double the speedup of the best individual heuristic, and 33% more speedup than the best combination of heuristics. The specific contributions of this paper include, first, a description of task selection based on immediate post-dominance for a system that speculatively creates tasks. Second, our experimental evaluation demonstrates that existing task-selection heuristics based on loops, procedure calls, and if-else statements are all subsumed by compiler-generated immediate postdominators. Finally, by demonstrating that dynamic reconvergence prediction closely approximates immediate postdominator analysis, we show that the notion of immediate postdominators may also be useful in constructing dynamic task selection mechanisms","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130411122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines 行蒸馏:通过过滤缓存行中未使用的单词来增加缓存容量

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346202

Moinuddin K. Qureshi, M. A. Suleman, Y. Patt

Caches are organized at a line-size granularity to exploit spatial locality. However, when spatial locality is low, many words in the cache line are not used. Unused words occupy cache space but do not contribute to cache hits. Filtering these words can allow the cache to store more cache lines. We show that unused words in a cache line are unlikely to be accessed in the less recent part of the LRU stack. We propose line distillation (LDIS), a technique that retains only the used words and evicts the unused words in a cache line. We also propose distill cache, a cache organization to utilize the capacity created by LDIS. Our experiments with 16 memory-intensive benchmarks show that LDIS reduces the average misses for a 1MB 8-way L2 cache by 30% and improves the average IPC by 12%

缓存以行大小粒度组织，以利用空间局部性。但是，当空间局部性较低时，缓存行中的许多单词不会被使用。未使用的字占用缓存空间，但不影响缓存命中。过滤这些单词可以允许缓存存储更多的缓存行。我们表明，缓存行中未使用的单词不太可能在LRU堆栈的较短部分中被访问。我们提出了行蒸馏(LDIS)，这是一种在缓存行中只保留使用过的单词并清除未使用的单词的技术。我们还提出了蒸馏缓存，一个缓存组织，以利用LDIS创建的容量。我们对16个内存密集型基准的实验表明，ldi将1MB 8路L2缓存的平均失误减少了30%，并将平均IPC提高了12%

引用次数: 84

A Domain-Specific On-Chip Network Design for Large Scale Cache Systems 面向大规模高速缓存系统的特定领域片上网络设计

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346209

Yuho Jin, Eun Jung Kim, K. H. Yum

As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology

随着电路集成技术的进步，高效互连的设计变得至关重要。采用片上网络来克服可伸缩性和共享总线或专用线路的资源共享问题。但是，对于特定的域使用通用的片上网络，由于没有针对特定的域优化互连，可能会导致网络资源的利用率不足和巨大的网络延迟。解决这两个问题具有挑战性，因为需要对互连和特定领域有深入的了解。非均匀缓存架构(nuca)使用虫洞路由的二维网格网络来提高片上L2缓存的性能。我们观察到，nuca中的网络资源未得到充分利用，占用了相当大的芯片面积(占缓存面积的52%)。此外，网络延迟也非常大(缓存访问时间的63%)。在我们的观察的激励下，我们研究了如何优化缓存操作和设计大规模缓存系统中的网络。我们提出了一种单周期路由器架构，可以有效地支持片上高速缓存中的多播。接下来，我们介绍快速lru替换，其中缓存替换与数据请求传递重叠。最后，我们提出了一种无死锁的XYX路由算法和一种新的halo网络拓扑结构，以最大限度地减少网络中的链路数量。仿真结果表明，该网络缓存系统在只占用23%的互联面积的情况下，比采用组播提升替代的网状网络设计提高了38%的IPC。其中，组播快速lru替换比组播提升替换平均IPC提高20%。halo拓扑设计比网状拓扑还能提高平均IPC 18%

{"title":"A Domain-Specific On-Chip Network Design for Large Scale Cache Systems","authors":"Yuho Jin, Eun Jung Kim, K. H. Yum","doi":"10.1109/HPCA.2007.346209","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346209","url":null,"abstract":"As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134304241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Concurrent Direct Network Access for Virtual Machine Monitors 虚拟机监视器的并发直接网络访问

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346208

Jeffrey Shafer, D. Carr, Aravind Menon, S. Rixner, A. Cox, W. Zwaenepoel, Paul Willmann

This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements

本文介绍了在虚拟机监视器内运行的操作系统实现并发直接网络访问(CDNA)的硬件和软件机制。在传统的虚拟机监视器中，在虚拟机中运行的每个操作系统必须通过软件虚拟化的网络接口访问网络。这些虚拟网络接口在软件中被多路复用到物理网络接口上，从而导致显著的性能开销。CDNA架构通过在硬件和软件之间划分流量复用、中断传递和内存保护等任务，提高了网络效率和性能。当网络接口执行网络数据的多路复用时，虚拟机监视器提供中断并在虚拟机之间提供保护。实际上，CDNA架构提供了一种抽象，即每个虚拟机都直接连接到自己的网络接口。通过使用CDNA，可以在不牺牲保护的情况下消除软件多路复用带来的许多瓶颈，从而大大提高效率

{"title":"Concurrent Direct Network Access for Virtual Machine Monitors","authors":"Jeffrey Shafer, D. Carr, Aravind Menon, S. Rixner, A. Cox, W. Zwaenepoel, Paul Willmann","doi":"10.1109/HPCA.2007.346208","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346208","url":null,"abstract":"This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 165

LogTM-SE: Decoupling Hardware Transactional Memory from Caches LogTM-SE:将硬件事务性内存与缓存解耦

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346204

Luke Yen, J. Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, M. Hill, M. Swift, D. Wood

This paper proposes a hardware transactional memory (HTM) system called LogTM Signature Edition (LogTM-SE). LogTM-SE uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection). Transactions update memory "in place" after saving the old value in a per-thread memory log (eager version management). Finally, a transaction commits locally by clearing its signature, resetting the log pointer, etc., while aborts must undo the log. LogTM-SE achieves two key benefits. First, signatures and logs can be implemented without changes to highly-optimized cache arrays because LogTM-SE never moves cached data, changes a blocks cache state, or flash clears bits in the cache. Second, transactions are more easily virtualized because signatures and logs are software accessible, allowing the operating system and runtime to save and restore this state. In particular, LogTM-SE allows cache victimization, unbounded nesting (both open and closed), thread context switching and migration, and paging

本文提出了一种名为LogTM签名版(LogTM- se)的硬件事务性内存(HTM)系统。LogTM-SE使用签名来总结事务读写集，并检测一致性请求上的冲突(渴望冲突检测)。事务在将旧值保存在每个线程的内存日志中(即时版本管理)后“就地”更新内存。最后，事务通过清除其签名、重置日志指针等方式在本地提交，而中止必须撤销日志。LogTM-SE实现了两个关键优势。首先，签名和日志可以在不更改高度优化的缓存数组的情况下实现，因为LogTM-SE从不移动缓存数据、更改块缓存状态或flash清除缓存中的位。其次，事务更容易虚拟化，因为签名和日志是软件可访问的，允许操作系统和运行时保存和恢复此状态。特别是，LogTM-SE允许缓存受害、无界嵌套(开放和关闭)、线程上下文切换和迁移以及分页

引用次数: 365

Petascale Computing Research Challenges - A Manycore Perspective 千兆级计算研究挑战-多核视角

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346188

S. Pawlowski

Summary form only given. Future high performance computing will undoubtedly reach Petascale and beyond. Today's HPC is tomorrow's personal computing. What are the evolving processor architectures towards multi-core and many-core for the best performance per watt; memory bandwidth solutions to feed the ever more powerful processors; intra-chip interconnect options for optimal bandwidth vs. power? With Moore's Law continuing to prove its viability and shrinking transistors' geometry, improving reliability is even more challenging. Intel Senior Fellow and Chief Technology Officer of Intel's Digital Enterprise Group, Steve Pawlowski, will provide his technology vision, insight and research challenges to achieve the vision of Petascale computing and beyond

只提供摘要形式。未来的高性能计算无疑将达到千兆级甚至更高。今天的HPC就是明天的个人计算。什么是不断发展的处理器架构朝着多核和多核的最佳性能每瓦;内存带宽解决方案，以满足更强大的处理器;片内互连选项的最佳带宽与功率?随着摩尔定律不断证明其可行性，晶体管的几何尺寸不断缩小，提高可靠性变得更具挑战性。英特尔高级研究员兼英特尔数字企业集团首席技术官Steve Pawlowski将提供他的技术愿景、洞察力和研究挑战，以实现千兆级计算的愿景

引用次数: 2

HARD: Hardware-Assisted Lockset-based Race Detection 硬:硬件辅助的基于锁集的种族检测

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346191

Pin Zhou, R. Teodorescu, Yuanyuan Zhou

The emergence of multicore architectures will lead to an increase in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Software solutions for detecting data races generally incur large overheads. Hardware support for race detection can significantly reduce that overhead. However, all existing hardware proposals for race detection are based on the happens-before algorithm which is sensitive to thread interleaving and cannot detect races that are not exposed during the monitored run. The lockset algorithm addresses this limitation. Unfortunately, due to the challenging issues such as storing the lockset information and performing complex set operations, so far it has been implemented only in software with 10-30 times performance hit. This paper proposes the first hardware implementation (called HARD) of the lockset algorithm to exploit the race detection capability of this algorithm with minimal overhead. HARD efficiently stores lock sets in hardware bloom filters and converts the expensive set operations into fast bitwise logic operations with negligible overhead. We evaluate HARD using six SPLASH-2 applications with 60 randomly injected bugs. Our results show that HARD can detect 54 out of 60 tested bugs, 20% more than happens-before, with only 0.1-2.6% of execution overhead. We also show our hardware design is cost-effective by comparing with the ideal lockset implementation, which would require a large amount of hardware resources

多核架构的出现将导致多线程应用程序的使用增加，这些应用程序容易出现同步错误，例如数据竞争。用于检测数据竞争的软件解决方案通常会产生很大的开销。对竞争检测的硬件支持可以显著减少这种开销。然而，所有现有的竞争检测硬件建议都基于happens-before算法，该算法对线程交错很敏感，不能检测在监视运行期间未暴露的竞争。锁集算法解决了这个限制。不幸的是，由于存储锁集信息和执行复杂的集操作等具有挑战性的问题，到目前为止，它只在软件中实现，性能下降了10-30倍。本文提出了锁集算法的第一个硬件实现(称为HARD)，以最小的开销利用该算法的竞争检测能力。HARD有效地将锁集存储在硬件bloom过滤器中，并将昂贵的集合操作转换为开销可以忽略不计的快速位逻辑操作。我们使用6个带有60个随机注入的bug的SPLASH-2应用程序来评估HARD。我们的结果表明，HARD可以检测到60个测试错误中的54个，比以前多20%，而执行开销仅为0.1-2.6%。通过比较理想的锁集实现，我们还证明了我们的硬件设计具有成本效益，而理想的锁集实现需要大量的硬件资源

{"title":"HARD: Hardware-Assisted Lockset-based Race Detection","authors":"Pin Zhou, R. Teodorescu, Yuanyuan Zhou","doi":"10.1109/HPCA.2007.346191","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346191","url":null,"abstract":"The emergence of multicore architectures will lead to an increase in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Software solutions for detecting data races generally incur large overheads. Hardware support for race detection can significantly reduce that overhead. However, all existing hardware proposals for race detection are based on the happens-before algorithm which is sensitive to thread interleaving and cannot detect races that are not exposed during the monitored run. The lockset algorithm addresses this limitation. Unfortunately, due to the challenging issues such as storing the lockset information and performing complex set operations, so far it has been implemented only in software with 10-30 times performance hit. This paper proposes the first hardware implementation (called HARD) of the lockset algorithm to exploit the race detection capability of this algorithm with minimal overhead. HARD efficiently stores lock sets in hardware bloom filters and converts the expensive set operations into fast bitwise logic operations with negligible overhead. We evaluate HARD using six SPLASH-2 applications with 60 randomly injected bugs. Our results show that HARD can detect 54 out of 60 tested bugs, 20% more than happens-before, with only 0.1-2.6% of execution overhead. We also show our hardware design is cost-effective by comparing with the ideal lockset implementation, which would require a large amount of hardware resources","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 163

Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures 基于令牌一致性签名的缓存一致性在线检测错误

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346193

A. Meixner, Daniel J. Sorin

To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent research has explored dynamic checking of cache coherence as a comprehensive approach to memory system error detection. However, existing coherence checkers are costly to implement, incur high interconnection network traffic overhead, and do not scale well. In this paper, we describe the token coherence signature checker (TCSC), which provides comprehensive, low-cost, scalable coherence checking by maintaining signatures that represent recent histories of coherence events at all nodes (cache and memory controllers). Periodically, these signatures are sent to a verifier to determine if an error occurred. TCSC has a small constant hardware cost per node, independent of cache and memory size and the number of nodes. TCSC's interconnect bandwidth overhead has a constant upper bound and never exceeds 7% in our experiments. TCSC has negligible impact on system performance

为了在多线程系统中提供高可靠性，系统必须检测并纠正其共享内存系统中的错误。最近的研究探索了缓存一致性的动态检查作为存储系统错误检测的综合方法。然而，现有的一致性检查器实现成本高，导致高互连网络流量开销，并且不能很好地扩展。在本文中，我们描述了token一致性签名检查器(TCSC)，它通过维护代表所有节点(缓存和内存控制器)一致性事件最近历史的签名来提供全面，低成本，可扩展的一致性检查。定期将这些签名发送给验证器，以确定是否发生错误。TCSC每个节点的硬件成本很小，与缓存和内存大小以及节点数量无关。TCSC的互连带宽开销具有恒定的上限，在我们的实验中从未超过7%。TCSC对系统性能的影响可以忽略不计

引用次数: 34

Colorama: Architectural Support for Data-Centric Synchronization Colorama:以数据为中心的同步的架构支持

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346192

L. Ceze, Pablo Montesinos, C. V. Praun, J. Torrellas

With the advent of ubiquitous multi-core architectures, a major challenge is to simplify parallel programming. One way to tame one of the main sources of programming complexity, namely synchronization, is transactional memory (TM). However, we argue that TM does not go far enough, since the programmer still needs nonlocal reasoning to decide where to place transactions in the code. A significant improvement to the art is data-centric synchronization (DCS), where the programmer uses local reasoning to assign synchronization constraints to data. Based on these, the system automatically infers critical sections and inserts synchronization operations. This paper proposes novel architectural support to make DCS feasible, and describes its programming model and interface. The proposal, called Colorama, needs only modest hardware extensions, supports general-purpose, pointer-based languages such as C/C++ and, in our opinion, can substantially simplify the task of writing new parallel programs

随着无处不在的多核架构的出现，一个主要的挑战是简化并行编程。控制编程复杂性的主要来源之一(即同步)的一种方法是使用事务性内存。然而，我们认为TM还远远不够，因为程序员仍然需要非局部推理来决定在代码中放置事务的位置。该技术的一个重要改进是以数据为中心的同步(DCS)，其中程序员使用本地推理为数据分配同步约束。在此基础上，系统自动推断临界区并插入同步操作。本文提出了一种新颖的体系结构支持，并描述了其编程模型和接口。这个被称为Colorama的提议只需要适度的硬件扩展，支持通用的、基于指针的语言，如C/ c++，在我们看来，可以大大简化编写新的并行程序的任务

引用次数: 44

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2007 IEEE 13th International Symposium on High Performance Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀