10th International Symposium on High Performance Computer Architecture (HPCA'04)最新文献

英文中文

Exploring Wakeup-Free Instruction Scheduling 探索无唤醒指令调度

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10014

Jie S. Hu, N. Vijaykrishnan, M. J. Irwin

Design of wakeup-free issue queues is becoming desirable due to the increasing complexity associated with broadcast-based instruction wakeup. The effectiveness of most wakeup-free issue queue designs is critically based on their success in predicting the issue latency of an instruction accurately. Consequently, the goal of this paper is to explore the predictability of instruction issue latency under different design constraints and to identify the impediments to performance in such wakeup-free architectures. Our results indicate that structural problems in promoting instructions to the head of the instruction queue from where they are issued in wakeup-free architectures, the limited number of candidate instructions that can be considered for instruction issue, and the resource conflicts due to non-availability of issue ports all have a significant impact in degrading the performance of broadcast free architectures. Based on these observation, we explore an architecture that attempts to overcome the structural limitations by employing traditional selection logic and by using pre-check logic to reduce the impact of resource conflicts while still employing a wakeup-free strategy based on predicted instruction issue latencies. Finally, we improve this technique by limiting the selection logic to a small segment of the issue queue.

由于与基于广播的指令唤醒相关的复杂性日益增加，无唤醒问题队列的设计变得越来越必要。大多数无唤醒问题队列设计的有效性在很大程度上取决于它们能否准确预测指令的问题延迟。因此，本文的目标是探索不同设计约束下指令问题延迟的可预测性，并确定这种无唤醒架构中性能的障碍。我们的研究结果表明，在无唤醒体系结构中，将指令从指令队列的头部提升到指令队列的头部，可以考虑用于指令发布的候选指令数量有限，以及由于发布端口不可用而导致的资源冲突都对降低无广播体系结构的性能有重大影响。基于这些观察，我们探索了一种架构，该架构试图通过使用传统的选择逻辑和使用预检查逻辑来克服结构限制，以减少资源冲突的影响，同时仍然采用基于预测指令问题延迟的无唤醒策略。最后，我们通过将选择逻辑限制在问题队列的一小段来改进此技术。

{"title":"Exploring Wakeup-Free Instruction Scheduling","authors":"Jie S. Hu, N. Vijaykrishnan, M. J. Irwin","doi":"10.1109/HPCA.2004.10014","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10014","url":null,"abstract":"Design of wakeup-free issue queues is becoming desirable due to the increasing complexity associated with broadcast-based instruction wakeup. The effectiveness of most wakeup-free issue queue designs is critically based on their success in predicting the issue latency of an instruction accurately. Consequently, the goal of this paper is to explore the predictability of instruction issue latency under different design constraints and to identify the impediments to performance in such wakeup-free architectures. Our results indicate that structural problems in promoting instructions to the head of the instruction queue from where they are issued in wakeup-free architectures, the limited number of candidate instructions that can be considered for instruction issue, and the resource conflicts due to non-availability of issue ports all have a significant impact in degrading the performance of broadcast free architectures. Based on these observation, we explore an architecture that attempts to overcome the structural limitations by employing traditional selection logic and by using pre-check logic to reduce the impact of resource conflicts while still employing a wakeup-free strategy based on predicted instruction issue latencies. Finally, we improve this technique by limiting the selection logic to a small segment of the issue queue.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128249154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Synthesizing Representative I/O Workloads for TPC-H 综合TPC-H的典型I/O工作负载

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10019

Jianyong Zhang, A. Sivasubramaniam, H. Franke, N. Gautam, Yanyong Zhang, S. Nagar

Synthesizing I/O requests that can accurately capture workload behavior is extremely valuable for the design, implementation and optimization of disk subsystems. This paper presents a synthetic workload generator for TPC-H, an important decision-support commercial workload, by completely characterizing the arrival and access patterns of its queries. We present a novel approach for parameterizing the behavior of inter-mingling streams of sequential requests, and exploit correlations between multiple attributes of these requests, to generate disk block-level traces that are shown to accurately mimic the behavior of a real trace in terms of response time characteristics for each TPC-H query.

合成能够准确捕获工作负载行为的I/O请求对于磁盘子系统的设计、实现和优化非常有价值。本文通过完整表征TPC-H查询的到达和访问模式，为TPC-H这一重要的决策支持商业工作负载提供了一个综合工作负载生成器。我们提出了一种新的方法来参数化顺序请求的混合流的行为，并利用这些请求的多个属性之间的相关性来生成磁盘块级跟踪，这些跟踪可以准确地模拟每个TPC-H查询的响应时间特征方面的真实跟踪的行为。

引用次数: 55

Link-time path-sensitive memory redundancy elimination 链路时间路径敏感内存冗余消除

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10009

Manel Fernández, R. Espasa

Optimizations performed at link-time or directly applied to final program executables have received increased attention in recent years. We discuss the discovery and elimination of redundant memory operations in the context of a link-time optimizer, an optimization that we call memory redundancy elimination (MRE). Previous research showed that existing MRE techniques are mainly based on path-insensitive information, which causes many MRE opportunities to be lost. We present a new technique for eliminating redundant loads in a path-sensitive fashion, by using a novel alias analysis algorithm that is able to expose path-sensitive memory redundancies. We also extend our previous work by removing both redundant and dead stores. Our experiments show that around 75% of load and 10% of store references in a program can be considered redundant, because they are accessing memory locations that have been referenced less than 256 memory instructions away. By combining our previous optimizations for eliminating load redundancies with the new techniques developed, we show that around 18% of the loads and 8% of the stores can be detected and eliminated, which translates into a 10% reduction in execution time.

在链接时执行的优化或直接应用于最终程序可执行文件的优化近年来受到越来越多的关注。我们在链接时间优化器的上下文中讨论冗余内存操作的发现和消除，我们称之为内存冗余消除(MRE)的优化。以往的研究表明，现有的MRE技术主要基于路径不敏感信息，这导致了许多MRE机会的丢失。我们提出了一种以路径敏感的方式消除冗余负载的新技术，通过使用一种能够暴露路径敏感内存冗余的新颖别名分析算法。我们还通过删除冗余和死亡商店来扩展我们之前的工作。我们的实验表明，程序中大约75%的load引用和10%的store引用可以被认为是冗余的，因为它们访问的内存位置被引用的距离小于256个内存指令。通过将我们以前消除负载冗余的优化与开发的新技术相结合，我们发现可以检测和消除大约18%的负载和8%的存储，这意味着执行时间减少了10%。

{"title":"Link-time path-sensitive memory redundancy elimination","authors":"Manel Fernández, R. Espasa","doi":"10.1109/HPCA.2004.10009","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10009","url":null,"abstract":"Optimizations performed at link-time or directly applied to final program executables have received increased attention in recent years. We discuss the discovery and elimination of redundant memory operations in the context of a link-time optimizer, an optimization that we call memory redundancy elimination (MRE). Previous research showed that existing MRE techniques are mainly based on path-insensitive information, which causes many MRE opportunities to be lost. We present a new technique for eliminating redundant loads in a path-sensitive fashion, by using a novel alias analysis algorithm that is able to expose path-sensitive memory redundancies. We also extend our previous work by removing both redundant and dead stores. Our experiments show that around 75% of load and 10% of store references in a program can be considered redundant, because they are accessing memory locations that have been referenced less than 256 memory instructions away. By combining our previous optimizations for eliminating load redundancies with the new techniques developed, we show that around 18% of the loads and 8% of the stores can be detected and eliminated, which translates into a 10% reduction in execution time.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126853548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Understanding scheduling replay schemes 理解调度重放方案

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10011

I. Kim, Mikko H. Lipasti

Modern microprocessors adopt speculative scheduling techniques where instructions are scheduled several clock cycles before they actually execute. Due to this scheduling delay, scheduling misses should be recovered across the multiple levels of dependence chains in order to prevent further unnecessary execution. We explore the design space of various scheduling replay schemes that prevent the propagation of scheduling misses, and find that current and proposed replay schemes do not scale well and require instructions to execute in correct data dependence order, since they track dependences among instructions within the instruction window as a part of the scheduling or execution process. We propose token-based selective replay that moves the dependence information propagation loop out of the scheduler, enabling lower complexity in the scheduling logic and support for data-speculation techniques at the expense of marginal IPC degradation compared to an ideal selective replay scheme.

现代微处理器采用推测调度技术，其中指令在实际执行之前被调度几个时钟周期。由于这种调度延迟，应该跨多个依赖链级别恢复调度失败，以防止进一步不必要的执行。我们探索了各种调度重播方案的设计空间，以防止调度失误的传播，并发现当前和提出的重播方案不能很好地扩展，并且要求指令以正确的数据依赖顺序执行，因为它们跟踪指令窗口内指令之间的依赖关系，作为调度或执行过程的一部分。我们提出了基于令牌的选择性重放，将依赖信息传播循环移出调度器，与理想的选择性重放方案相比，以牺牲IPC的边际退化为代价，降低了调度逻辑和支持数据推测技术的复杂性。

引用次数: 69

Stream register files with indexed access 具有索引访问的流寄存器文件

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10007

N. Jayasena, M. Erez, Jung Ho Ahn, W. Dally

Many current programmable architecture designed to exploit data parallelism require computation to be structured to operate on sequentially accessed vectors or streams of data. Applications with less regular data access patterns perform sub-optimally on such architectures. We present a register file for streams (SRF) that allows arbitrary, indexed accesses. Compared to sequential SRF access, indexed access captures more temporal locality, reduces data replication in the SRF, and provides efficient support for certain types of complex access patterns. Our simulations show that indexed SRF access provides speedups of 1.03x to 4.1x and memory bandwidth reductions of up to 95% over sequential SRF access for a set of benchmarks representative of data-parallel applications with irregular accesses. Indexed SRF access also provides greater speedups than caches for a number of application classes despite significantly lower hardware costs. The area overhead of our indexed SRF implementation is 11%-22% over a sequentially accessed SRF, which corresponds to a modest 1.5%-3% increase in the total die area of a typical stream processor.

当前许多设计用于开发数据并行性的可编程架构都要求将计算结构化，以便对顺序访问的向量或数据流进行操作。数据访问模式不太规则的应用程序在这种体系结构上的性能不是最优的。我们提出了一个允许任意索引访问的流寄存器文件(SRF)。与顺序SRF访问相比，索引访问捕获更多的时间局部性，减少SRF中的数据复制，并为某些类型的复杂访问模式提供有效支持。我们的模拟表明，对于具有不规则访问的数据并行应用程序的一组基准测试，与顺序SRF访问相比，索引SRF访问提供了1.93倍到4.1倍的速度和高达95%的内存带宽减少。对于许多应用程序类，尽管硬件成本显著降低，但与缓存相比，索引SRF访问也提供了更高的速度。与顺序访问的SRF相比，我们的索引SRF实现的面积开销是11%-22%，这相当于典型流处理器的总模具面积增加了1.5%-3%。

引用次数: 53

Using prime numbers for cache indexing to eliminate conflict misses 使用素数进行缓存索引以消除冲突缺失

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10015

Mazen Kharbutli, Keith Irwin, Yan Solihin, Jaejin Lee

Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. We present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have nonuniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%.

使用替代的缓存索引/散列函数是一种流行的技术，通过在缓存中的集合之间实现更统一的缓存访问分布来减少冲突缺失。尽管已经证明了各种可选的散列函数可以消除最坏情况下的冲突行为，但没有研究真正分析过这种散列函数的病态行为，这种行为通常会导致性能下降。我们对缓存哈希函数的病态行为进行了深入的分析。基于分析，我们提出了两个新的哈希函数:素数模和素数位移，它们可以抵抗病理行为，但能够消除L2缓存中最坏情况下的冲突行为。我们展示了这两种方案可以使用一组窄的添加操作在快速硬件中实现，L2缓存中的碎片可以忽略不计。我们在23个内存密集型应用中评估了这些方案。对于具有非均匀缓存访问的应用程序，与传统散列相比，素数模和素数位移散列的平均加速提高了1.27，而23个基准测试中的任何一个都没有减慢速度。我们还使用多个素数位移哈希函数与倾斜关联L2缓存一起进行评估。倾斜关联缓存实现了更好的平均加速，但代价是一些病态行为，使四个应用程序的速度降低了7%。

{"title":"Using prime numbers for cache indexing to eliminate conflict misses","authors":"Mazen Kharbutli, Keith Irwin, Yan Solihin, Jaejin Lee","doi":"10.1109/HPCA.2004.10015","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10015","url":null,"abstract":"Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. We present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have nonuniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122004026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 114

Low-complexity distributed issue queue 低复杂度的分布式问题队列

10th International Symposium on High Performance Computer Architecture (HPCA'04)

Pub Date : 2004-02-14 DOI: 10.1109/HPCA.2004.10013

J. Abella, Antonio González

As technology evolves, power density significantly increases and cooling systems become more complex and expensive. The issue logic is one of the processor hotspots and, at the same time, its latency is crucial for the processor performance. We present a low-complexity FP issue logic (MB/spl I.bar/distr) that achieves high performance with small energy requirements. The MB/spl I.bar/distr scheme is based on classifying instructions and dispatching them into a set of queues depending on their data dependences. These instructions are selected for issuing based on an estimation of when their operands will be available, so the conventional wakeup activity is not required. Additionally, the functional units are distributed across the different queues. The energy required by the proposed scheme is substantially lower than that required by a conventional issue design, even if the latter has the ability of waking-up only unready operands. MB/spl I.bar/distr scheme reduces the energy-delay product by 35% and the energy-delay product by 18% with respect to a state-of-the-art approach.

随着技术的发展，功率密度显著增加，冷却系统变得更加复杂和昂贵。问题逻辑是处理器的热点之一，同时，它的延迟对处理器性能至关重要。我们提出了一种低复杂度的FP问题逻辑(MB/spl .bar/distr)，以小的能量需求实现高性能。MB/spl I.bar/distr方案基于对指令进行分类，并根据它们的数据依赖性将它们分派到一组队列中。这些指令是根据对其操作数可用时间的估计来选择发出的，因此不需要传统的唤醒活动。此外，功能单元分布在不同的队列中。所提出的方案所需的能量大大低于传统问题设计所需的能量，即使后者具有仅唤醒未就绪操作数的能力。与最先进的方法相比，MB/spl I.bar/distr方案可将能量延迟积降低35%，将能量延迟积降低18%。

引用次数: 21

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

10th International Symposium on High Performance Computer Architecture (HPCA'04)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀