首页 > 最新文献

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.最新文献

英文 中文
Incorporating predicate information into branch predictors 将谓词信息合并到分支预测器中
B. Simon, B. Calder, J. Ferrante
Predicated execution can be used to alleviate the costs associated with frequently mispredicted branches. This is accomplished by trading the cost of a mispredicted branch for execution of both paths following the conditional branch. In this paper we examine two enhancements for branch prediction in the presence of predicated code. Both of the techniques use recently calculated predicate definitions to provide a more intelligent branch prediction. The first branch predictor, called the squash false path filter, recognizes fetched branches known to be guarded with a false predicate and predicts them as not-taken with 100% accuracy. The second technique, called the predicate global update branch predictor, improves prediction by incorporating recent predicate information into the branch predictor. We use these techniques to aid the prediction of region-based branches. A region-based branch is a branch that is left in a predicated region of code. A region-based branch may be correlated with predicate definitions in the region in addition to those that define the branch's guarding predicate.
可以使用预测执行来减少与经常预测错误的分支相关的成本。这是通过交换错误预测分支的成本来实现的,以执行条件分支之后的两条路径。在本文中,我们研究了在预测代码存在的情况下分支预测的两个增强。这两种技术都使用最近计算的谓词定义来提供更智能的分支预测。第一个分支预测器称为squash假路径过滤器,它识别已知使用假谓词保护的提取分支,并以100%的准确率预测它们未被获取。第二种技术称为谓词全局更新分支预测器,它通过将最近的谓词信息合并到分支预测器中来改进预测。我们使用这些技术来帮助预测基于区域的分支。基于区域的分支是留在预测代码区域中的分支。除了那些定义分支的保护谓词的定义外,基于区域的分支还可以与该区域中的谓词定义相关联。
{"title":"Incorporating predicate information into branch predictors","authors":"B. Simon, B. Calder, J. Ferrante","doi":"10.1109/HPCA.2003.1183524","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183524","url":null,"abstract":"Predicated execution can be used to alleviate the costs associated with frequently mispredicted branches. This is accomplished by trading the cost of a mispredicted branch for execution of both paths following the conditional branch. In this paper we examine two enhancements for branch prediction in the presence of predicated code. Both of the techniques use recently calculated predicate definitions to provide a more intelligent branch prediction. The first branch predictor, called the squash false path filter, recognizes fetched branches known to be guarded with a false predicate and predicts them as not-taken with 100% accuracy. The second technique, called the predicate global update branch predictor, improves prediction by incorporating recent predicate information into the branch predictor. We use these techniques to aid the prediction of region-based branches. A region-based branch is a branch that is left in a predicated region of code. A region-based branch may be correlated with predicate definitions in the region in addition to those that define the branch's guarding predicate.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130931730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Scalar operand networks: on-chip interconnect for ILP in partitioned architectures 标量操作数网络:分区体系结构中ILP的片上互连
M. Taylor, Walter Lee, Saman P. Amarasinghe, A. Agarwal
The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor. The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties.
微处理器中的旁路路径和多端口寄存器文件作为隐式互连,在管道级和多个ALU之间通信操作数值。以前的超标量设计使用集中式结构实现这种互连,这种结构不能随着ILP需求的增加而扩展。为了寻求可扩展性,最近工业和学术界的微处理器设计呈现出分布式资源的趋势,例如分区寄存器文件、银行缓存、多个独立的计算管道,甚至多个程序计数器。这些分区微处理器设计中的一些已经开始使用点对点互连而不是集中式网络来实现旁路和操作数传输。我们把为标量数据传输优化的互连称为标量操作数网络,无论是集中式的还是分布式的。尽管这些网络共享多处理器网络的许多挑战,例如可伸缩性和死锁避免,但它们有许多独特的要求,包括超低延迟(几个周期相对于几十个周期)和超快的操作-操作数匹配。本文讨论了标量操作数网络的独特性质,研究了实现它们的替代方法,并详细描述了在Raw微处理器中实现这样一个网络的方法。本文分析了这些网络在ILP工作负载下的性能,以及整体ILP性能对网络特性的敏感性。
{"title":"Scalar operand networks: on-chip interconnect for ILP in partitioned architectures","authors":"M. Taylor, Walter Lee, Saman P. Amarasinghe, A. Agarwal","doi":"10.1109/HPCA.2003.1183551","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183551","url":null,"abstract":"The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor. The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128000870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 164
Billion transistor chips in mainstream enterprise platforms of the future 未来主流企业平台的十亿晶体管芯片
D. Bhandarkar
Today’s leading edge microprocessors like the Intel’s Itanium ® 2 Processor feature over 220 million transistors in 0.18µm semiconductor process technology. Nanotechnology that continues to drive Moore’s Law provides a doubling of the transistor density every two years. This indicates that a Billion transistor chip is possible in the 65 nm technology within the next 3 to 4 years. Such chips can be used in mainstream enterprise server platforms. This talk will review the progress in semiconductor technology over the last 3 decades since the introduction of the first microprocessor in 1971. A short video tape will provide a historical perspective on Moore’s Law in the form of an interview with co-founder Gordon Moore, and his thoughts for the future of semiconductor technology. Key trends in high end microprocessor design including multi-threading and multi-core will be covered. We have started to see “SMP-on-a-chip” designs for high-end enterprise servers where two processors with Level 2 (L2) cache are incorporated on a single chip. Future microprocessors will offer higher levels of multiprocessor capability on chip as the transistor density increases.
当今领先的微处理器,如英特尔的Itanium®2处理器,具有超过2.2亿个晶体管,采用0.18 μ m半导体工艺技术。继续推动摩尔定律的纳米技术使晶体管密度每两年翻一番。这表明,在未来3到4年内,65纳米技术的十亿晶体管芯片是可能的。这种芯片可以应用于主流的企业服务器平台。本讲座将回顾自1971年第一个微处理器问世以来,半导体技术在过去30年中取得的进展。一段简短的录像带将以采访联合创始人戈登·摩尔的形式提供摩尔定律的历史视角,以及他对半导体技术未来的看法。高端微处理器设计的主要趋势包括多线程和多核。我们已经开始看到用于高端企业服务器的“单片smp”设计,其中两个具有2级(L2)缓存的处理器集成在单个芯片上。随着晶体管密度的增加,未来的微处理器将在芯片上提供更高水平的多处理器能力。
{"title":"Billion transistor chips in mainstream enterprise platforms of the future","authors":"D. Bhandarkar","doi":"10.1109/HPCA.2003.1183519","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183519","url":null,"abstract":"Today’s leading edge microprocessors like the Intel’s Itanium ® 2 Processor feature over 220 million transistors in 0.18µm semiconductor process technology. Nanotechnology that continues to drive Moore’s Law provides a doubling of the transistor density every two years. This indicates that a Billion transistor chip is possible in the 65 nm technology within the next 3 to 4 years. Such chips can be used in mainstream enterprise server platforms. This talk will review the progress in semiconductor technology over the last 3 decades since the introduction of the first microprocessor in 1971. A short video tape will provide a historical perspective on Moore’s Law in the form of an interview with co-founder Gordon Moore, and his thoughts for the future of semiconductor technology. Key trends in high end microprocessor design including multi-threading and multi-core will be covered. We have started to see “SMP-on-a-chip” designs for high-end enterprise servers where two processors with Level 2 (L2) cache are incorporated on a single chip. Future microprocessors will offer higher levels of multiprocessor capability on chip as the transistor density increases.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128274379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Inter-cluster communication models for clustered VLIW processors 集群VLIW处理器的集群间通信模型
A. Terechko, Erwan Le Thenaff, Manish Garg, J. V. Eijndhoven, H. Corporaal
Clustering is a well-known technique to improve the implementation of single register file VLIW processors. Many previous studies in clustering adhere to an inter-cluster communication means in the form of copy operations. This paper, however, identifies and evaluates five different inter-cluster communication models, including copy operations, dedicated issue slots, extended operands, extended results, and broadcasting. Our study reveals that these models have a major impact on performance and implementation of the clustered VLIW. We found that copy operations executed in regular VLIW issue slots significantly constrain the scheduling freedom of regular operations. For example, in the dense code for our four cluster machine the total cycle count overhead reached 46.8% with respect to the unicluster architecture, 56% of which are caused by the copy operation constraint. Therefore, we propose to use other models (e.g. extended results or broadcasting), which deliver higher performance than the copy operation model at the same hardware cost.
聚类是一种众所周知的改进单寄存器文件VLIW处理器实现的技术。许多先前的集群研究坚持以复制操作的形式实现集群间的通信。然而,本文确定并评估了五种不同的集群间通信模型,包括复制操作、专用问题槽、扩展操作数、扩展结果和广播。我们的研究表明,这些模型对集群VLIW的性能和实现有重大影响。我们发现,在常规VLIW问题槽中执行的复制操作显著地限制了常规操作的调度自由度。例如,在我们的四个集群机器的密集代码中,相对于单集群架构,总周期计数开销达到46.8%,其中56%是由复制操作约束引起的。因此,我们建议使用其他模型(例如扩展结果或广播),它们在相同的硬件成本下提供比复制操作模型更高的性能。
{"title":"Inter-cluster communication models for clustered VLIW processors","authors":"A. Terechko, Erwan Le Thenaff, Manish Garg, J. V. Eijndhoven, H. Corporaal","doi":"10.1109/HPCA.2003.1183552","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183552","url":null,"abstract":"Clustering is a well-known technique to improve the implementation of single register file VLIW processors. Many previous studies in clustering adhere to an inter-cluster communication means in the form of copy operations. This paper, however, identifies and evaluates five different inter-cluster communication models, including copy operations, dedicated issue slots, extended operands, extended results, and broadcasting. Our study reveals that these models have a major impact on performance and implementation of the clustered VLIW. We found that copy operations executed in regular VLIW issue slots significantly constrain the scheduling freedom of regular operations. For example, in the dense code for our four cluster machine the total cycle count overhead reached 46.8% with respect to the unicluster architecture, 56% of which are caused by the copy operation constraint. Therefore, we propose to use other models (e.g. extended results or broadcasting), which deliver higher performance than the copy operation model at the same hardware cost.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117339390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
A methodology for designing efficient on-chip interconnects on well-behaved communication patterns 一种在良好的通信模式下设计高效片上互连的方法
W. Ho, T. Pinkston
As the level of chip integration continues to advance at a fast pace, the desire for efficient interconnects - whether on-chip or off-chip - is rapidly increasing. Traditional interconnects like buses, point-to-point wires and regular topologies may suffer from poor resource sharing in the time and space domains, leading to high contention or low resource utilization. In this paper, we propose a design methodology for constructing networks for special-purpose computer systems with well-behaved (known) communication characteristics. A temporal and spatial model is proposed to define the sufficient condition for contention-free communication. Based upon this model, a design methodology using a recursive bisection technique is applied to systematically partition a parallel system such that the required number of links and switches is minimized while achieving low contention. Results show that the design methodology can generate more optimized on-chip networks with up to 60% fewer resources than meshes or tori while providing blocking performance closer to that of a fully connected crossbar.
随着芯片集成水平的持续快速发展,对高效互连的需求——无论是片内还是片外——正在迅速增加。传统的互连(如总线、点对点布线和常规拓扑)在时间和空间域中的资源共享可能会很差,从而导致高争用或低资源利用率。在本文中,我们提出了一种为具有良好(已知)通信特性的专用计算机系统构建网络的设计方法。提出了一个时空模型来定义无争用通信的充分条件。在此模型的基础上,应用递归对分技术的设计方法对并行系统进行系统划分,使所需的链路和开关数量最小化,同时实现低争用。结果表明,该设计方法可以生成更优化的片上网络,比网格或环面少60%的资源,同时提供更接近完全连接的交叉棒的阻塞性能。
{"title":"A methodology for designing efficient on-chip interconnects on well-behaved communication patterns","authors":"W. Ho, T. Pinkston","doi":"10.1109/HPCA.2003.1183554","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183554","url":null,"abstract":"As the level of chip integration continues to advance at a fast pace, the desire for efficient interconnects - whether on-chip or off-chip - is rapidly increasing. Traditional interconnects like buses, point-to-point wires and regular topologies may suffer from poor resource sharing in the time and space domains, leading to high contention or low resource utilization. In this paper, we propose a design methodology for constructing networks for special-purpose computer systems with well-behaved (known) communication characteristics. A temporal and spatial model is proposed to define the sufficient condition for contention-free communication. Based upon this model, a design methodology using a recursive bisection technique is applied to systematically partition a parallel system such that the required number of links and switches is minimized while achieving low contention. Results show that the design methodology can generate more optimized on-chip networks with up to 60% fewer resources than meshes or tori while providing blocking performance closer to that of a fully connected crossbar.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123729617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
TCP: tag correlating prefetchers TCP:标签关联预取器
Zhigang Hu, M. Martonosi, S. Kaxiras
Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of dead-block prediction. In many proposals, however correlating prefetchers demand a significant investment in hardware. In this paper we show that correlating prefetchers that work with tags instead of cache-line addresses are significantly more resource-efficient, providing equal or better performance than previous proposals. We support this claim by showing that per-set tag sequences exhibit highly repetitive patterns both within a set and across different sets. Because a single tag sequence can capture multiple address sequences spread over different cache sets, significant space savings can be achieved. We propose a tag-based prefetcher called a tag correlating prefetcher (TCP). Even with very small history tables, TCP outperforms address-based correlating prefetchers many times larger. In addition, we show that such a prefetcher can yield most of its performance benefits if placed at the L2 level of an aggressive out-of-order processor. Only if one wants prefetching all the way up to L1, is dead-block prediction required. Finally, we draw parallels between the two-level structure of TCP and similar structures for branch prediction mechanisms; these parallels raise interesting opportunities for improving correlating memory prefetchers by harnessing lessons already learned for correlating branch predictors.
虽然缓存几十年来一直是内存系统的支柱,但CPU和主存之间的速度差距表明,它们可以通过预取机制得到增强。最近,已经提出了复杂的硬件相关预取机制,在某些情况下与某种形式的死块预测相结合。然而,在许多建议中,关联预取器需要在硬件上进行大量投资。在本文中,我们展示了与标签而不是缓存行地址一起工作的关联预取器明显更节约资源,提供与以前的建议相同或更好的性能。我们通过显示每个集合标签序列在一个集合内和跨不同集合表现出高度重复的模式来支持这一说法。由于单个标记序列可以捕获分布在不同缓存集中的多个地址序列,因此可以实现显著的空间节省。我们提出了一种基于标签的预取器,称为标签相关预取器(TCP)。即使使用非常小的历史表,TCP的性能也优于基于地址的关联预取器。此外,我们还表明,如果将这样的预取器放在主动乱序处理器的L2级,则可以获得大部分性能优势。只有当想要一直预取到L1时,才需要死块预测。最后,我们将TCP的两层结构与分支预测机制的类似结构进行了类比;这些相似之处为改进相关内存预取器提供了有趣的机会,可以利用已经从相关分支预测器中学到的经验教训。
{"title":"TCP: tag correlating prefetchers","authors":"Zhigang Hu, M. Martonosi, S. Kaxiras","doi":"10.1109/HPCA.2003.1183549","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183549","url":null,"abstract":"Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of dead-block prediction. In many proposals, however correlating prefetchers demand a significant investment in hardware. In this paper we show that correlating prefetchers that work with tags instead of cache-line addresses are significantly more resource-efficient, providing equal or better performance than previous proposals. We support this claim by showing that per-set tag sequences exhibit highly repetitive patterns both within a set and across different sets. Because a single tag sequence can capture multiple address sequences spread over different cache sets, significant space savings can be achieved. We propose a tag-based prefetcher called a tag correlating prefetcher (TCP). Even with very small history tables, TCP outperforms address-based correlating prefetchers many times larger. In addition, we show that such a prefetcher can yield most of its performance benefits if placed at the L2 level of an aggressive out-of-order processor. Only if one wants prefetching all the way up to L1, is dead-block prediction required. Finally, we draw parallels between the two-level structure of TCP and similar structures for branch prediction mechanisms; these parallels raise interesting opportunities for improving correlating memory prefetchers by harnessing lessons already learned for correlating branch predictors.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128198928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 97
Runahead execution: an alternative to very large instruction windows for out-of-order processors 超前执行:无序处理器非常大的指令窗口的替代方案
O. Mutlu, J. Stark, C. Wilkerson, Y. Patt
Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.
今天的高性能处理器通过乱序执行来容忍长延迟操作。然而,随着延迟的增加,如果我们要继续容忍这些延迟,指令窗口的大小必须增加得更快。我们已经达到了这样一个点,即可以处理这些延迟的指令窗口的大小在设计复杂性和功耗方面都大得令人望而却步。而且,问题越来越严重。本文提出提前执行是一种在无序处理器中增加内存延迟容错性的有效方法,而不需要不合理的大指令窗口。运行前执行解除被长延迟操作阻塞的指令窗口的阻塞,允许处理器在程序路径上遥遥领先地执行。这导致数据在需要之前就被预取到缓存中。在基于Intel/spl reg/ Pentium/spl reg/处理器的机器模型上,在大量内存密集型应用程序中,拥有128个入口指令窗口,添加超前执行可将IPC(每周期指令)提高22%。同样,对于相同的机器模型,结合了128个入口窗口的提前执行在没有提前执行和384个入口指令窗口的机器的1%内执行。
{"title":"Runahead execution: an alternative to very large instruction windows for out-of-order processors","authors":"O. Mutlu, J. Stark, C. Wilkerson, Y. Patt","doi":"10.1109/HPCA.2003.1183532","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183532","url":null,"abstract":"Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130823120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 452
Slipstream execution mode for CMP-based multiprocessors 基于cmp的多处理器的滑流执行模式
K. Ibrahim, G. Byrd, E. Rotenberg
Scalability of applications on distributed shared-memory (DSM) multiprocessors is limited by communication overheads. At some point, using more processors to increase parallelism yields diminishing returns or even degrades performance. When increasing concurrency is futile, we propose an additional mode of execution, called slipstream mode, that instead enlists extra processors to assist parallel tasks by reducing perceived overheads. We consider DSM multiprocessors built from dual-processor chip multiprocessor (CMP) nodes with shared L2 cache. A task is allocated on one processor of each CMP node. The other processor of each node executes a reduced version of the same task. The reduced version skips shared-memory stores and synchronization, running ahead of the true task. Even with the skipped operations, the reduced task makes accurate forward progress and generates an accurate reference stream, because branches and addresses depend primarily on private data. Slipstream execution mode yields two benefits. First, the reduced task prefetches data on behalf of the true task. Second, reduced tasks provide a detailed picture of future reference behavior, enabling a number of optimizations aimed at accelerating coherence events, e.g., self-invalidation. For multiprocessor systems with up to 16 CMP nodes, slipstream mode outperforms running one or two conventional tasks per CMP in 7 out of 9 parallel scientific benchmarks. Slipstream mode is 12-19% faster with prefetching only and up to 29% faster with self-invalidation enabled.
分布式共享内存(DSM)多处理器上应用程序的可伸缩性受到通信开销的限制。在某种程度上,使用更多的处理器来增加并行性会产生递减的回报,甚至会降低性能。当增加并发性是徒劳的时候,我们提出了一种额外的执行模式,称为滑流模式,它通过减少感知开销来使用额外的处理器来帮助并行任务。我们考虑由双处理器芯片多处理器(CMP)节点构建的DSM多处理器,具有共享的L2缓存。在每个CMP节点的一个处理器上分配一个任务。每个节点的另一个处理器执行同一任务的简化版本。精简版跳过共享内存存储和同步,在真正的任务之前运行。即使跳过了这些操作,简化后的任务也会进行准确的向前推进,并生成准确的引用流,因为分支和地址主要依赖于私有数据。滑流执行模式有两个好处。首先,简化后的任务代表真正的任务预取数据。其次,减少的任务提供了未来引用行为的详细图景,使许多旨在加速相干事件的优化成为可能,例如,自我失效。对于具有多达16个CMP节点的多处理器系统,在9个并行科学基准测试中的7个中,滑流模式优于每个CMP运行一个或两个传统任务。滑流模式在仅预取时速度快12-19%,在启用自我失效时速度快29%。
{"title":"Slipstream execution mode for CMP-based multiprocessors","authors":"K. Ibrahim, G. Byrd, E. Rotenberg","doi":"10.1109/HPCA.2003.1183536","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183536","url":null,"abstract":"Scalability of applications on distributed shared-memory (DSM) multiprocessors is limited by communication overheads. At some point, using more processors to increase parallelism yields diminishing returns or even degrades performance. When increasing concurrency is futile, we propose an additional mode of execution, called slipstream mode, that instead enlists extra processors to assist parallel tasks by reducing perceived overheads. We consider DSM multiprocessors built from dual-processor chip multiprocessor (CMP) nodes with shared L2 cache. A task is allocated on one processor of each CMP node. The other processor of each node executes a reduced version of the same task. The reduced version skips shared-memory stores and synchronization, running ahead of the true task. Even with the skipped operations, the reduced task makes accurate forward progress and generates an accurate reference stream, because branches and addresses depend primarily on private data. Slipstream execution mode yields two benefits. First, the reduced task prefetches data on behalf of the true task. Second, reduced tasks provide a detailed picture of future reference behavior, enabling a number of optimizations aimed at accelerating coherence events, e.g., self-invalidation. For multiprocessor systems with up to 16 CMP nodes, slipstream mode outperforms running one or two conventional tasks per CMP in 7 out of 9 parallel scientific benchmarks. Slipstream mode is 12-19% faster with prefetching only and up to 29% faster with self-invalidation enabled.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131330806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Active I/O switches in system area networks 系统局域网络中的主动I/O交换机
M. Hao, Mark A. Heinrich
We present an active switch architecture to improve the performance of systems connected via system area networks. Our programmable active switches not only flexibly route packets between any combination of hosts and I/O devices, but also have the capability of running application-level code, forming a parallel processor in the SAN subsystem. By replacing existing SAN-based switches with a new active switch architecture, we can design a prototype system with otherwise commercially available, commodity parts that can dramatically speed up data-intensive applications and workloads on modern multi-programmed servers. We explain the programming model and detail the microarchitecture of our active switch, and analyze simulation results for nine benchmark applications that highlight various advantages of active switch-based systems.
我们提出了一种主动交换架构,以提高通过系统局域网连接的系统的性能。我们的可编程主动交换机不仅在任意主机和I/O设备组合之间灵活地路由数据包,而且还具有运行应用程序级代码的能力,在SAN子系统中形成并行处理器。通过用新的主动交换机架构取代现有的基于san的交换机,我们可以设计一个原型系统,该系统可以使用其他商业上可用的商品部件,这些部件可以显着加快现代多编程服务器上的数据密集型应用程序和工作负载。我们解释了编程模型,详细介绍了我们的主动开关的微架构,并分析了九个基准应用程序的仿真结果,这些应用程序突出了基于主动开关的系统的各种优势。
{"title":"Active I/O switches in system area networks","authors":"M. Hao, Mark A. Heinrich","doi":"10.1109/HPCA.2003.1183553","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183553","url":null,"abstract":"We present an active switch architecture to improve the performance of systems connected via system area networks. Our programmable active switches not only flexibly route packets between any combination of hosts and I/O devices, but also have the capability of running application-level code, forming a parallel processor in the SAN subsystem. By replacing existing SAN-based switches with a new active switch architecture, we can design a prototype system with otherwise commercially available, commodity parts that can dramatically speed up data-intensive applications and workloads on modern multi-programmed servers. We explain the programming model and detail the microarchitecture of our active switch, and analyze simulation results for nine benchmark applications that highlight various advantages of active switch-based systems.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132874428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Reconsidering complex branch predictors 重新考虑复杂的分支预测器
Daniel A. Jiménez
To sustain instruction throughput rates in more aggressively clocked microarchitectures, microarchitects have incorporated larger and more complex branch predictors into their designs, taking advantage of the increasing numbers of transistors available on a chip. Unfortunately, because of penalties associated with their implementations, the extra accuracy provided by many branch predictors does not produce a proportionate increase in performance. Specifically, we show that the techniques used to hide the latency of a large and complex branch predictor do not scale well and will be unable to sustain IPC for deeper pipelines. We investigate a different way to build large branch predictors. We propose an alternative predictor design that completely hides predictor latency so that accuracy and hardware budget are the only factors that affect the efficiency of the predictor. Our simple design allows the predictor to be pipelined efficiently by avoiding difficulties introduced by complex predictors. Because this predictor eliminates the penalties associated with complex predictors, overall performance exceeds that of even the most accurate known branch predictors in the literature at large hardware budgets. We conclude that as chip densities increase in the next several years, the accuracy of complex branch predictors must be weighed against the performance benefits of simple branch predictors.
为了在更积极的微体系结构中维持指令吞吐率,微架构师在他们的设计中加入了更大、更复杂的分支预测器,利用芯片上可用的晶体管数量不断增加的优势。不幸的是,由于与它们的实现相关的惩罚,许多分支预测器提供的额外准确性并没有产生相应的性能提高。具体来说,我们表明用于隐藏大型复杂分支预测器延迟的技术不能很好地扩展,并且无法维持更深管道的IPC。我们研究了一种不同的方法来构建大型分支预测器。我们提出了一种替代的预测器设计,它完全隐藏了预测器延迟,因此准确性和硬件预算是影响预测器效率的唯一因素。我们的简单设计允许预测器通过避免复杂预测器引入的困难而有效地流水线化。因为这个预测器消除了与复杂预测器相关的损失,所以在大型硬件预算下,总体性能甚至超过了文献中已知最准确的分支预测器。我们得出结论,随着芯片密度在未来几年的增加,复杂分支预测器的准确性必须与简单分支预测器的性能优势进行权衡。
{"title":"Reconsidering complex branch predictors","authors":"Daniel A. Jiménez","doi":"10.1109/HPCA.2003.1183523","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183523","url":null,"abstract":"To sustain instruction throughput rates in more aggressively clocked microarchitectures, microarchitects have incorporated larger and more complex branch predictors into their designs, taking advantage of the increasing numbers of transistors available on a chip. Unfortunately, because of penalties associated with their implementations, the extra accuracy provided by many branch predictors does not produce a proportionate increase in performance. Specifically, we show that the techniques used to hide the latency of a large and complex branch predictor do not scale well and will be unable to sustain IPC for deeper pipelines. We investigate a different way to build large branch predictors. We propose an alternative predictor design that completely hides predictor latency so that accuracy and hardware budget are the only factors that affect the efficiency of the predictor. Our simple design allows the predictor to be pipelined efficiently by avoiding difficulties introduced by complex predictors. Because this predictor eliminates the penalties associated with complex predictors, overall performance exceeds that of even the most accurate known branch predictors in the literature at large hardware budgets. We conclude that as chip densities increase in the next several years, the accuracy of complex branch predictors must be weighed against the performance benefits of simple branch predictors.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
期刊
The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1