Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI:10.1109/HPCA.2002.995709

P. Wang, Hong Wang, Jamison D. Collins, Edward T. Grochowski, R. Kling, John Paul Shen

{"title":"Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation","authors":"P. Wang, Hong Wang, Jamison D. Collins, Edward T. Grochowski, R. Kling, John Paul Shen","doi":"10.1109/HPCA.2002.995709","DOIUrl":null,"url":null,"abstract":"The performance of in-order execution Itanium/sup TM/ processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other assumes multithreading support and exploits cache prefetching via speculative precomputation (SP). This paper evaluates and contrasts these two approaches. In addition, this paper assesses the effectiveness of combining the two approaches. For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an out-of-order design (87%). Applying both 000 and SP yields a total performance improvement of 141% over the baseline in-order machine. OOO tends to be effective in prefetching-for L1 misses; whereas SP is primarily good at covering L2 and L3 misses. Our analysis indicates that the two approaches can be redundant or complementary depending on the type of delinquent loads that each targets. Both approaches are effective on delinquent loads in the loop body; however only SP is effective on delinquent loads found in loop control code.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Eighth International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2002.995709","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

The performance of in-order execution Itanium/sup TM/ processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other assumes multithreading support and exploits cache prefetching via speculative precomputation (SP). This paper evaluates and contrasts these two approaches. In addition, this paper assesses the effectiveness of combining the two approaches. For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an out-of-order design (87%). Applying both 000 and SP yields a total performance improvement of 141% over the baseline in-order machine. OOO tends to be effective in prefetching-for L1 misses; whereas SP is primarily good at covering L2 and L3 misses. Our analysis indicates that the two approaches can be redundant or complementary depending on the type of delinquent loads that each targets. Both approaches are effective on delinquent loads in the loop body; however only SP is effective on delinquent loads found in loop control code.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Itanium处理器的内存延迟容忍方法:乱序执行与推测性预计算

由于缓存丢失，按顺序执行的Itanium/sup TM/处理器的性能可能会受到严重影响。两种内存延迟容忍方法可以应用于Itanium处理器。一个使用乱序(OOO)执行核心;另一种假设支持多线程，并通过推测预计算(SP)利用缓存预取。本文对这两种方法进行了评价和对比。此外，本文还评估了两种方法相结合的有效性。对于一组选定的内存密集型程序，使用推测预计算的有序SMT Itanium处理器可以实现与无序设计(87%)相当的性能改进(92%)。同时应用000和SP，总性能比基准有序机器提高141%。OOO在预取中往往是有效的——对于L1缺失;而SP主要擅长于补上L2和L3失误。我们的分析表明，这两种方法可以是冗余的，也可以是互补的，这取决于每个目标的拖欠负荷的类型。两种方法都能有效地处理环体内的逾期荷载;然而，只有SP对在循环控制代码中发现的不良负载有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings Eighth International Symposium on High Performance Computer Architecture

自引率

0.00%

发文量

期刊最新文献

Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors Tuning garbage collection in an embedded Java environment Power issues related to branch prediction Using internal redundant representations and limited bypass to support pipelined adders and register files Modeling value speculation