Improving load/store queues usage in scientific computing

International Conference on Parallel Processing, 2004. ICPP 2004. Pub Date : 2004-08-15 DOI:10.1109/ICPP.2004.1327902

C. Lemuet, W. Jalby, S. Touati

{"title":"Improving load/store queues usage in scientific computing","authors":"C. Lemuet, W. Jalby, S. Touati","doi":"10.1109/ICPP.2004.1327902","DOIUrl":null,"url":null,"abstract":"Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Parallel Processing, 2004. ICPP 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2004.1327902","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

提高科学计算中的负载/存储队列使用率

内存消歧机制，加上乱序处理器中的负载/存储队列，对于提高指令级并行性(ILP)至关重要，特别是对于内存约束的科学代码。设计理想的内存消歧机制过于复杂，因为它需要精确的地址位比较器;因此，现代微处理器实现了简化和不精确的，只执行部分地址比较。在本文中，我们研究了这种简化对Alpha 21264、Power 4和Itanium 2等实际处理器的持续性能的影响。尽管这些处理器具有所有的高级特性，但我们在本文中证明，内存地址消歧机制可能会导致显著的性能损失。我们证明，即使数据位于较低的缓存级别，并且存在足够的ILP，如果不注意访问独立内存地址的顺序，性能下降可能会慢21倍。而不是提出硬件解决方案来改善负载/存储队列，如[G]。Chrysos等人，(1998)，S. Sethumadhavan等人，(2003)，I. Park等人，(2003)，a . Yoaz等人，(1999)，S. Onder(2002)]，我们表明软件(编译)技术是可能的。这种解决方案是基于经典的(鲁棒的)Id/st矢量化。我们的实验突出了该方法对具有代表性的矢量科学环路的blas1编码的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Conference on Parallel Processing, 2004. ICPP 2004.

自引率

0.00%

发文量

期刊最新文献

Non-uniform dependences partitioned by recurrence chains Clustering strategies for cluster timestamps An effective fault-tolerant routing methodology for direct networks Complexity results and heuristics for pipelined multicast operations on heterogeneous platforms Low-cost register-pressure prediction for scalar replacement using pseudo-schedules