首页 > 最新文献

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.最新文献

英文 中文
Caches and hash trees for efficient memory integrity verification 缓存和哈希树用于有效的内存完整性验证
B. Gassend, G. Suh, Dwaine E. Clarke, Marten van Dijk, S. Devadas
We study the hardware cost of implementing hash-tree based verification of untrusted external memory by a high performance processor. This verification could enable applications such as certified program execution. A number of schemes are presented with different levels of integration between the on-processor L2 cache and the hash-tree machinery. Simulations show that for the best of our methods, the performance overhead is less than 25%, a significant decrease from the 10/spl times/ overhead of a naive implementation.
我们研究了用高性能处理器实现基于哈希树的不可信外部存储器验证的硬件成本。这种验证可以启用认证程序执行之类的应用程序。在处理器上的L2缓存和哈希树机制之间,提出了许多具有不同级别集成的方案。模拟表明,对于我们最好的方法,性能开销小于25%,与原始实现的10/spl次/开销相比有显著降低。
{"title":"Caches and hash trees for efficient memory integrity verification","authors":"B. Gassend, G. Suh, Dwaine E. Clarke, Marten van Dijk, S. Devadas","doi":"10.1109/HPCA.2003.1183547","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183547","url":null,"abstract":"We study the hardware cost of implementing hash-tree based verification of untrusted external memory by a high performance processor. This verification could enable applications such as certified program execution. A number of schemes are presented with different levels of integration between the on-processor L2 cache and the hash-tree machinery. Simulations show that for the best of our methods, the performance overhead is less than 25%, a significant decrease from the 10/spl times/ overhead of a naive implementation.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127284219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 282
Tradeoffs in buffering memory state for thread-level speculation in multiprocessors 在多处理器中为线程级推测缓冲内存状态的权衡
M. Garzarán, Milos Prvulović, J. Llabería, V. Viñals, Lawrence Rauchwerger, J. Torrellas
Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffering and managing multi-version speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for merging the state of tasks with main memory lazily. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for future state in main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.
线程级推测为并行运行难以分析的代码提供了架构支持。由于推测任务并发运行,它们会生成不安全或推测的内存状态,这些状态需要在分布式缓存和缓冲区中单独缓冲和管理。这种状态可能包含同一变量的多个版本。在本文中,我们介绍了在多处理器中缓冲和管理多版本推测内存状态的一种新的方法分类。我们还对不同方法进行了详细的复杂性-效益权衡分析。最后,我们使用数值应用来评估在单一架构框架下这些方法的性能。我们的主要见解是,支持缓冲每个处理器的多个推测任务和版本的状态比支持惰性地合并主存的任务状态更有效。此外,这两种支撑可以有效地组合在一起,在大型机器中,它们的效果几乎是完全相加的。最后,对主存中未来状态的更复杂的支持可以在缓冲区处于压力下时提高性能,但在频繁挤压时损害性能。
{"title":"Tradeoffs in buffering memory state for thread-level speculation in multiprocessors","authors":"M. Garzarán, Milos Prvulović, J. Llabería, V. Viñals, Lawrence Rauchwerger, J. Torrellas","doi":"10.1109/HPCA.2003.1183537","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183537","url":null,"abstract":"Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffering and managing multi-version speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for merging the state of tasks with main memory lazily. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for future state in main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132572278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
A statistically rigorous approach for improving simulation methodology 改进模拟方法的统计严谨方法
J. Yi, D. Lilja, D. Hawkins
Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: (1) identify key processor parameters; (2) classify benchmarks based on how they affect the processor; and (3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor.
由于成本、时间和灵活性的限制,在开发新的处理器体系结构以及评估新处理器增强的性能时,通常使用模拟器来探索设计空间。然而,尽管这种对模拟器的依赖,统计上严格的模拟方法通常不用于计算机体系结构研究。正式的方法可以通过增加统计严谨性,为从模拟结果中得出结论提供坚实的基础,从而提高模拟结果的可信度。本文演示了严格的统计技术在模拟过程的设置和分析阶段的应用。具体来说,我们采用Plackett和Burman设计:(1)确定关键处理器参数;(2)根据对处理器的影响对基准进行分类;(3)分析了处理器性能增强的效果。我们的技术扩展了以前的工作,应用统计方法来改进仿真方法,而不是应用统计模型来估计处理器的性能。
{"title":"A statistically rigorous approach for improving simulation methodology","authors":"J. Yi, D. Lilja, D. Hawkins","doi":"10.1109/HPCA.2003.1183546","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183546","url":null,"abstract":"Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: (1) identify key processor parameters; (2) classify benchmarks based on how they affect the processor; and (3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"83 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131349479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 171
Just say no: benefits of early cache miss determination 直接说不:早期缓存缺失判断的好处
G. Memik, Glenn D. Reinman, W. Mangione-Smith
As the performance gap between the processor cores and the memory subsystem increases, designers are forced to develop new latency hiding techniques. Arguably, the most common technique is to utilize multi-level caches. Each new generation of processors is equipped with higher levels of memory hierarchy with increasing sizes at each level. In this paper, we propose 5 different techniques that will reduce the data access times and power consumption in processors with multi-level caches. Using the information about the blocks placed into and replaced from the caches, the techniques quickly determine whether an access at any cache level will be a miss. The accesses that are identified to miss are aborted. The structures used to recognize misses are much smaller than the cache structures. Consequently the data access times and power consumption are reduced. Using the SimpleScalar simulator, we study the performance of these techniques for a processor with 5 cache levels. The best technique is able to abort 53.1% of the misses on average in SPEC2000 applications. Using these techniques, the execution time of the applications is reduced by up to 12.4% (5.4% on average), and the power consumption of the caches is reduced by as much as 11.6% (3.8% on average).
随着处理器内核和内存子系统之间的性能差距越来越大,设计人员被迫开发新的延迟隐藏技术。可以说,最常用的技术是利用多级缓存。每一代新处理器都配备了更高级别的内存层次结构,每个级别的内存大小都在增加。在本文中,我们提出了5种不同的技术,这些技术将减少具有多级缓存的处理器的数据访问时间和功耗。使用关于放入缓存和从缓存中替换的块的信息,这些技术可以快速确定在任何缓存级别的访问是否会失败。被识别为失败的访问将被中止。用于识别失误的结构比缓存结构小得多。从而减少了数据访问次数和功耗。使用SimpleScalar模拟器,我们研究了这些技术在具有5个缓存级别的处理器上的性能。在SPEC2000应用中,最好的技术平均能够中止53.1%的失误。使用这些技术,应用程序的执行时间最多减少12.4%(平均减少5.4%),缓存的功耗最多减少11.6%(平均减少3.8%)。
{"title":"Just say no: benefits of early cache miss determination","authors":"G. Memik, Glenn D. Reinman, W. Mangione-Smith","doi":"10.1109/HPCA.2003.1183548","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183548","url":null,"abstract":"As the performance gap between the processor cores and the memory subsystem increases, designers are forced to develop new latency hiding techniques. Arguably, the most common technique is to utilize multi-level caches. Each new generation of processors is equipped with higher levels of memory hierarchy with increasing sizes at each level. In this paper, we propose 5 different techniques that will reduce the data access times and power consumption in processors with multi-level caches. Using the information about the blocks placed into and replaced from the caches, the techniques quickly determine whether an access at any cache level will be a miss. The accesses that are identified to miss are aborted. The structures used to recognize misses are much smaller than the cache structures. Consequently the data access times and power consumption are reduced. Using the SimpleScalar simulator, we study the performance of these techniques for a processor with 5 cache levels. The best technique is able to abort 53.1% of the misses on average in SPEC2000 applications. Using these techniques, the execution time of the applications is reduced by up to 12.4% (5.4% on average), and the power consumption of the caches is reduced by as much as 11.6% (3.8% on average).","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116223556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Dynamic data dependence tracking and its application to branch prediction 动态数据依赖跟踪及其在分支预测中的应用
Lei Chen, S. Dropsho, D. Albonesi
To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch execution architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its performance. We then use this design in a new value-based branch prediction design using available register value information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch predictor With ARVI used as the second-level branch predictor the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite.
为了继续提高处理器性能,微架构师寻求增加应用程序中可以利用的有效指令级并行性(ILP)。改善ILP的一个基本限制是指令之间的数据依赖性。如果数据依赖性信息在运行时可用,那么就有很多方法可以改进ILP。先前发布的示例包括解耦分支执行架构和关键指令检测。在本文中,我们描述了一种有效的硬件机制来动态跟踪管道中指令的数据依赖链。这些信息可以在每个循环的基础上提供给微引擎,以优化其性能。然后,我们使用可用的寄存器值信息(ARVI)在新的基于值的分支预测设计中使用该设计。从数据依赖信息的使用来看,ARVI分支预测器比同等大小的混合分支预测器具有更好的预测精度。使用ARVI作为第二级分支预测器,改进的预测精度在SPEC95整数基准套件中平均提高了12.6%的性能。
{"title":"Dynamic data dependence tracking and its application to branch prediction","authors":"Lei Chen, S. Dropsho, D. Albonesi","doi":"10.1109/HPCA.2003.1183525","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183525","url":null,"abstract":"To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch execution architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its performance. We then use this design in a new value-based branch prediction design using available register value information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch predictor With ARVI used as the second-level branch predictor the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122261545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Performance enhancement techniques for InfiniBand/sup TM/ Architecture InfiniBand/sup TM/架构的性能增强技术
Eun Jung Kim, K. H. Yum, C. Das, Mazin S. Yousif, J. Duato
The InfiniBand/sup TM/ Architecture (IBA) is envisioned to be the default communication fabric for future system area networks (SAN). However, the released IBA specification outlines only higher level functionalities, leaving it open for exploring various design alternatives. In this paper we investigate four co-related techniques to provide high and predictable performance in IBA. These are: (i) using the shortest path first (SPF) algorithm for deterministic packet routing; (ii) developing a multipath routing mechanism for minimizing congestion; (iii) developing a selective packet dropping scheme to handle deadlock and congestion; and (iv) providing multicasting support for customized applications. These designs are evaluated using an integrated workload on a versatile IBA simulation testbed. Simulation results indicate that the SPF routing, multipath routing, packet dropping, and multicasting schemes are quite effective in delivering high and assured performance in clusters. One of the major contributions of this research is the IBA simulation testbed, which is an essential tool to evaluate various design tradeoffs.
InfiniBand/sup TM/ Architecture (IBA)被设想为未来系统区域网络(SAN)的默认通信结构。然而,发布的IBA规范只概述了更高级别的功能,为探索各种设计替代方案留下了空间。在本文中,我们研究了四种相互关联的技术,以提供IBA的高和可预测的性能。它们是:(i)使用最短路径优先(SPF)算法进行确定性分组路由;(ii)发展多径路由机制,以尽量减少拥塞;(iii)制订选择性弃包方案,以处理死锁和拥塞;(iv)为定制应用程序提供多播支持。这些设计是在一个通用的IBA模拟测试平台上使用集成工作负载进行评估的。仿真结果表明,SPF路由、多径路由、丢包和组播等方案都能有效地在集群中提供高可靠的性能。本研究的主要贡献之一是IBA仿真试验台,它是评估各种设计权衡的重要工具。
{"title":"Performance enhancement techniques for InfiniBand/sup TM/ Architecture","authors":"Eun Jung Kim, K. H. Yum, C. Das, Mazin S. Yousif, J. Duato","doi":"10.1109/HPCA.2003.1183543","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183543","url":null,"abstract":"The InfiniBand/sup TM/ Architecture (IBA) is envisioned to be the default communication fabric for future system area networks (SAN). However, the released IBA specification outlines only higher level functionalities, leaving it open for exploring various design alternatives. In this paper we investigate four co-related techniques to provide high and predictable performance in IBA. These are: (i) using the shortest path first (SPF) algorithm for deterministic packet routing; (ii) developing a multipath routing mechanism for minimizing congestion; (iii) developing a selective packet dropping scheme to handle deadlock and congestion; and (iv) providing multicasting support for customized applications. These designs are evaluated using an integrated workload on a versatile IBA simulation testbed. Simulation results indicate that the SPF routing, multipath routing, packet dropping, and multicasting schemes are quite effective in delivering high and assured performance in clusters. One of the major contributions of this research is the IBA simulation testbed, which is an essential tool to evaluate various design tradeoffs.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115587378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Dynamic voltage scaling with links for power optimization of interconnection networks 用于互联网络功率优化的链路动态电压缩放
L. Shang, L. Peh, N. Jha
Originally developed to connect processors and memories in multicomputers, prior research and design of interconnection networks have focused largely on performance. As these networks get deployed in a wide range of new applications, where power is becoming a key design constraint, we need to seriously consider power efficiency in designing interconnection networks. As the demand for network bandwidth increases, communication links, already a significant consumer of power now, will take up an ever larger portion of total system power budget. In this paper we motivate the use of dynamic voltage scaling (DVS) for links, where the frequency and voltage of links are dynamically adjusted to minimize power consumption. We propose a history-based DVS policy that judiciously adjusts link frequencies and voltages based on past utilization. Our approach realizes up to 6.3/spl times/ power savings (4.6/spl times/ on average). This is accompanied by a moderate impact on performance (15.2% increase in average latency before network saturation and 2.5% reduction in throughput.) To the best of our knowledge, this is the first study that targets dynamic power optimization of interconnection networks.
互连网络最初是为了在多台计算机中连接处理器和存储器而开发的,但之前的研究和设计主要集中在性能上。随着这些网络在广泛的新应用中得到部署,功率正在成为一个关键的设计限制,我们需要在设计互连网络时认真考虑功率效率。随着对网络带宽需求的增加,通信链路已经是一个重要的电力消耗者,它在整个系统电力预算中所占的比例将越来越大。在本文中,我们鼓励对链路使用动态电压缩放(DVS),其中链路的频率和电压被动态调整以最小化功耗。我们提出了一个基于历史的分布式交换机策略,根据过去的利用率明智地调整链路频率和电压。我们的方法实现了高达6.3/spl次/的节能(平均4.6/spl次/)。这伴随着对性能的中等影响(网络饱和前平均延迟增加15.2%,吞吐量减少2.5%)。据我们所知,这是第一个针对互联网络动态功率优化的研究。
{"title":"Dynamic voltage scaling with links for power optimization of interconnection networks","authors":"L. Shang, L. Peh, N. Jha","doi":"10.1109/HPCA.2003.1183527","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183527","url":null,"abstract":"Originally developed to connect processors and memories in multicomputers, prior research and design of interconnection networks have focused largely on performance. As these networks get deployed in a wide range of new applications, where power is becoming a key design constraint, we need to seriously consider power efficiency in designing interconnection networks. As the demand for network bandwidth increases, communication links, already a significant consumer of power now, will take up an ever larger portion of total system power budget. In this paper we motivate the use of dynamic voltage scaling (DVS) for links, where the frequency and voltage of links are dynamically adjusted to minimize power consumption. We propose a history-based DVS policy that judiciously adjusts link frequencies and voltages based on past utilization. Our approach realizes up to 6.3/spl times/ power savings (4.6/spl times/ on average). This is accompanied by a moderate impact on performance (15.2% increase in average latency before network saturation and 2.5% reduction in throughput.) To the best of our knowledge, this is the first study that targets dynamic power optimization of interconnection networks.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 490
Exploring the VLSI scalability of stream processors 探索流处理器的VLSI可扩展性
Brucek Khailany, W. Dally, S. Rixner, U. Kapasi, John Douglas Owens, Brian Towles
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALU in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALU per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALU per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3/spl times/ of kernel speedup and 8.0/spl times/ of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.
流处理器是为运行媒体应用程序而优化的高性能可编程处理器。最近的研究表明,这些处理器比传统的可编程架构更节能。本文探讨了流架构对未来VLSI技术的可扩展性,其中单个芯片上超过一千个浮点单元将是可行的。提出了两种增加流处理器中ALU数量的技术:集群内扩展和集群间扩展。这些扩展技术对于每个集群的数十个ALU和数百个算术集群来说是经济有效的。在45纳米技术中,具有128个集群和每个集群5个ALU的640-ALU流处理器被证明是可行的,在内核上维持超过300 GOPS,并提供15.3/spl /倍的内核加速和8.0/spl /倍的应用加速,而每个ALU的面积下降2%,每个ALU操作的能量消耗下降7%。
{"title":"Exploring the VLSI scalability of stream processors","authors":"Brucek Khailany, W. Dally, S. Rixner, U. Kapasi, John Douglas Owens, Brian Towles","doi":"10.1109/HPCA.2003.1183534","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183534","url":null,"abstract":"Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALU in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALU per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALU per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3/spl times/ of kernel speedup and 8.0/spl times/ of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130133352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Deterministic clock gating for microprocessor power reduction 用于微处理器功耗降低的确定性时钟门控
Hai Helen Li, S. Bhunia, Yiran Chen, T. N. Vijaykumar, K. Roy
With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. Pipeline balancing (PLB), a previous technique, is essentially a methodology to clock-gate unused components whenever a program's instruction-level parallelism is predicted to be low. However, no nonpredictive methodologies are available in the literature for efficient clock gating. This paper introduces deterministic clock gating (DCG) based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time. Our experiments show an average of 19.9% reduction in processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D-Cache wordline decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss.
随着技术的规模化以及对更高性能和更多功能的需求,功耗正成为微处理器设计的主要瓶颈。管道平衡(PLB)是以前的一种技术,本质上是一种方法,每当预测程序的指令级并行性较低时,就对未使用的组件进行时钟门控制。然而,在文献中没有非预测性的方法可用于有效的时钟门控。本文介绍了确定性时钟门控(DCG),它基于一个关键观察,即对于现代管道中的许多阶段,电路块在不久的将来在特定周期中的使用情况是提前几个周期确定的。我们的实验表明,通过将DCG应用于执行单元、管道锁存器、D-Cache字行解码器和结果总线驱动程序,处理器功耗平均降低了19.9%,并且几乎没有性能损失。相比之下,PLB实现了9.9%的平均功耗节省和2.9%的性能损失。
{"title":"Deterministic clock gating for microprocessor power reduction","authors":"Hai Helen Li, S. Bhunia, Yiran Chen, T. N. Vijaykumar, K. Roy","doi":"10.1109/HPCA.2003.1183529","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183529","url":null,"abstract":"With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. Pipeline balancing (PLB), a previous technique, is essentially a methodology to clock-gate unused components whenever a program's instruction-level parallelism is predicted to be low. However, no nonpredictive methodologies are available in the literature for efficient clock gating. This paper introduces deterministic clock gating (DCG) based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time. Our experiments show an average of 19.9% reduction in processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D-Cache wordline decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115781430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Hierarchical backoff locks for nonuniform communication architectures 用于非统一通信体系结构的分层回退锁
Z. Radovic, Erik Hagersten
This paper identifies node affinity as an important property for scalable general-purpose locks. Nonuniform communication architectures (NUCA), for example CC-NUMA built from a few large nodes or from chip multiprocessors (CMP), have a lower penalty for reading data from a neighbor's cache than from a remote cache. Lock implementations that encourages handing over locks to neighbors will improve the lock handover time, as well as the access to the critical data guarded by the lock, but will also be vulnerable to starvation. We propose a set of simple software-based hierarchical backoff locks (HBO) that create node affinity in NUCA. A solution for lowering the risk of starvation is also suggested. The HBO locks are compared with other software-based lock implementations using simple benchmarks, and are shown to be very competitive for uncontested locks while being more than twice as fast for contended locks. An application study also demonstrates superior performance for applications with high lock contention and competitive performance for other programs.
本文将节点亲和性确定为可伸缩通用锁的一个重要属性。非统一通信架构(NUCA),例如由几个大节点或芯片多处理器(CMP)构建的CC-NUMA,从邻居缓存读取数据的代价比从远程缓存读取数据的代价要低。鼓励将锁移交给邻居的锁实现将改善锁移交时间,以及对锁保护的关键数据的访问,但也容易受到饥饿的影响。我们提出了一组简单的基于软件的分层回退锁(HBO),用于在NUCA中创建节点亲和力。还提出了降低饥饿风险的解决方案。使用简单的基准测试将HBO锁与其他基于软件的锁实现进行比较,结果表明,对于无争用锁来说,HBO锁非常有竞争力,而对于争用锁来说,HBO锁的速度要快两倍以上。应用程序研究还表明,具有高锁争用的应用程序具有优越的性能,并且具有其他程序的竞争性能。
{"title":"Hierarchical backoff locks for nonuniform communication architectures","authors":"Z. Radovic, Erik Hagersten","doi":"10.1109/HPCA.2003.1183542","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183542","url":null,"abstract":"This paper identifies node affinity as an important property for scalable general-purpose locks. Nonuniform communication architectures (NUCA), for example CC-NUMA built from a few large nodes or from chip multiprocessors (CMP), have a lower penalty for reading data from a neighbor's cache than from a remote cache. Lock implementations that encourages handing over locks to neighbors will improve the lock handover time, as well as the access to the critical data guarded by the lock, but will also be vulnerable to starvation. We propose a set of simple software-based hierarchical backoff locks (HBO) that create node affinity in NUCA. A solution for lowering the risk of starvation is also suggested. The HBO locks are compared with other software-based lock implementations using simple benchmarks, and are shown to be very competitive for uncontested locks while being more than twice as fast for contended locks. An application study also demonstrates superior performance for applications with high lock contention and competitive performance for other programs.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116609803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
期刊
The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1