Workshop on Memory Performance Issues最新文献

英文中文

Memory coherence activity prediction in commercial workloads 商业工作负载中的内存一致性活动预测

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054949

Stephen Somogyi, T. Wenisch, N. Hardavellas, Jangwoo Kim, A. Ailamaki, B. Falsafi

Recent research indicates that prediction-based coherence optimizations offer substantial performance improvements for scientific applications in distributed shared memory multiprocessors. Important commercial applications also show sensitivity to coherence latency, which will become more acute in the future as technology scales. Therefore it is important to investigate prediction of memory coherence activity in the context of commercial workloads.This paper studies a trace-based Downgrade Predictor (DGP) for predicting last stores to shared cache blocks, and a pattern-based Consumer Set Predictor (CSP) for predicting subsequent readers. We evaluate this class of predictors for the first time on commercial applications and demonstrate that our DGP correctly predicts 47%-76% of last stores. Memory sharing patterns in commercial workloads are inherently non-repetitive; hence CSP cannot attain high coverage. We perform an opportunity study of a DGP enhanced through competitive underlying predictors, and in commercial and scientific applications, demonstrate potential to increase coverage up to 14%.

最近的研究表明，基于预测的一致性优化为分布式共享内存多处理器的科学应用提供了实质性的性能改进。重要的商业应用也显示出对相干延迟的敏感性，随着技术规模的扩大，这将在未来变得更加严重。因此，在商业工作负载的背景下研究记忆一致性活动的预测是很重要的。本文研究了一个基于跟踪的降级预测器(DGP)，用于预测共享缓存块的最后存储，以及一个基于模式的消费者集预测器(CSP)，用于预测后续读者。我们首次在商业应用中评估了这类预测器，并证明我们的DGP正确预测了47%-76%的最新商店。商业工作负载中的内存共享模式本质上是非重复的;因此，CSP无法实现高覆盖率。我们对通过竞争性潜在预测因子增强的DGP进行了机会研究，并在商业和科学应用中证明了将覆盖率提高14%的潜力。

引用次数: 41

Micro-architecture techniques in the intel® E8870 scalable memory controller 英特尔®E8870可扩展内存控制器中的微架构技术

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054948

F. Briggs, S. Chittor, Kai Cheng

This paper describes several selected micro-architectural tradeoffs and optimizations for the scalable memory controller of the Intel E8870 chipset architecture. The Intel E8870 chipset architecture supports scalable coherent multiprocessor systems using 2 to 16 processors, and a point-to-point Scalability Port (SP) Protocol. The scalable memory controller micro-architecture applies a number of micro-architecture techniques to reduce the local & remote idle and loaded latencies. The performance optimizations were achieved within the constraints of maintaining functional correctness, while reducing implementation complexity and cost. High bandwidth point-to-point interconnects and distributed memory are expected to be more common in future platforms to support powerful multi-core processors. The selected techniques discussed in this paper will be applicable to scalable memory controllers needed in those platforms. These techniques have been proven for production systems for the Itanium® II Processor platforms.

本文介绍了Intel E8870芯片组架构的可扩展内存控制器的几个选择的微架构权衡和优化。英特尔E8870芯片组架构支持使用2到16个处理器的可扩展相干多处理器系统，以及点对点可扩展性端口(SP)协议。可扩展内存控制器微体系结构应用了许多微体系结构技术来减少本地和远程空闲和加载延迟。性能优化是在保持功能正确性的约束下实现的，同时降低了实现的复杂性和成本。高带宽点对点互连和分布式内存预计将在未来的平台上更加普遍，以支持强大的多核处理器。本文所讨论的技术将适用于这些平台中所需的可扩展内存控制器。这些技术已经在Itanium®II处理器平台的生产系统中得到了验证。

引用次数: 7

On the effectiveness of prefetching and reuse in reducing L1 data cache traffic: a case study of Snort 关于预取和重用在减少L1数据缓存流量方面的有效性:Snort的案例研究

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054955

G. Surendra, Subhasish Banerjee, S. Nandy

Reducing the number of data cache accesses improves performance, port efficiency, bandwidth and motivates the use of single ported caches instead of complex and expensive multi-ported ones. In this paper we consider an intrusion detection system as a target application and study the effectiveness of two techniques - (i) prefetching data from the cache into local buffers in the processor core and (ii) load Instruction Reuse (IR) - in reducing data cache traffic. The analysis is carried out using a microarchitecture and instruction set representative of a programmable processor with the aim of determining if the above techniques are viable for a programmable pattern matching engine found in many network processors. We find that IR is the most generic and efficient technique which reduces cache traffic by up to 60%. However, a combination of prefetching and IR with application specific tuning performs as well as and sometimes better than IR alone.

减少数据缓存访问的数量可以提高性能、端口效率和带宽，并鼓励使用单端口缓存，而不是复杂且昂贵的多端口缓存。本文以入侵检测系统为目标应用，研究了两种技术(1)从缓存中预取数据到处理器核心的本地缓冲区和(2)加载指令重用(IR))在减少数据缓存流量方面的有效性。分析是使用微架构和可编程处理器的指令集进行的，目的是确定上述技术是否适用于许多网络处理器中的可编程模式匹配引擎。我们发现IR是最通用和有效的技术，它可以减少高达60%的缓存流量。但是，将预取和IR与特定于应用程序的调优相结合，其性能与单独使用IR一样好，有时甚至更好。

引用次数: 1

Compiler-optimized usage of partitioned memories 分区内存的编译器优化使用

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054959

L. Wehmeyer, Urs Helmig, P. Marwedel

In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22% compared to previous approaches.

为了满足嵌入式系统对性能和能耗的要求，新的内存架构正在被引入。除了众所周知的在内存层次结构中使用缓存之外，今天的处理器内核还包括称为刮板存储器的小型片上存储器，其使用不受硬件控制，而是由程序员或编译器控制。利用这些刮擦板的技术已经有一段时间了。一些新的处理器提供了多个刮记板，因此有必要增强工作流程，以便有效地利用这种复杂的内存架构。在这项工作中，我们提出了一个能量模型和一个ILP公式，以在编译时将内存对象最佳地分配到刮记板内存的不同分区，与以前的方法相比，实现了高达22%的能源节约。

引用次数: 46

Understanding the effects of wrong-path memory references on processor performance 了解错误路径内存引用对处理器性能的影响

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054951

O. Mutlu, Hyesoon Kim, D. N. Armstrong, Y. Patt

High-performance out-of-order processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do not change the architectural state of the processor, they can affect the arrangement of data in the memory hierarchy. This paper examines the effects of wrong-path memory references on processor performance. It is shown that these references significantly affect the IPC (Instructions Per Cycle) performance of a processor. Not modeling them can lead to errors of up to 10% in IPC estimates for the SPEC2000 integer benchmarks; 7 out of 12 benchmarks experience an error of greater than 2% in IPC estimates. In general, the error in the IPC increases with increasing memory latency and instruction window size.We find that wrong-path references are usually beneficial for performance, because they prefetch data that will be used by later correct-path references. L2 cache pollution is found to be the most significant negative effect of wrong-path references. Code examples are shown to provide insights into how wrong-path references affect performance.

高性能乱序处理器在错误的程序路径上花费了大量的执行时间，即使它们采用了激进的分支预测算法。尽管在错误路径上生成的内存引用不会改变处理器的体系结构状态，但它们会影响内存层次结构中数据的排列。本文研究了错误路径内存引用对处理器性能的影响。结果表明，这些引用对处理器的IPC(指令周期)性能有显著影响。不对它们进行建模可能导致SPEC2000整数基准的IPC估计误差高达10%;在12个基准测试中，有7个的IPC估计误差超过2%。一般来说，IPC中的错误随着内存延迟和指令窗口大小的增加而增加。我们发现错误路径引用通常有利于提高性能，因为它们预取的数据将被稍后的正确路径引用使用。二级缓存污染是错误路径引用最显著的负面影响。本文展示了代码示例，以便深入了解错误路径引用如何影响性能。

{"title":"Understanding the effects of wrong-path memory references on processor performance","authors":"O. Mutlu, Hyesoon Kim, D. N. Armstrong, Y. Patt","doi":"10.1145/1054943.1054951","DOIUrl":"https://doi.org/10.1145/1054943.1054951","url":null,"abstract":"High-performance out-of-order processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do not change the architectural state of the processor, they can affect the arrangement of data in the memory hierarchy. This paper examines the effects of wrong-path memory references on processor performance. It is shown that these references significantly affect the IPC (Instructions Per Cycle) performance of a processor. Not modeling them can lead to errors of up to 10% in IPC estimates for the SPEC2000 integer benchmarks; 7 out of 12 benchmarks experience an error of greater than 2% in IPC estimates. In general, the error in the IPC increases with increasing memory latency and instruction window size.We find that wrong-path references are usually beneficial for performance, because they prefetch data that will be used by later correct-path references. L2 cache pollution is found to be the most significant negative effect of wrong-path references. Code examples are shown to provide insights into how wrong-path references affect performance.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131748325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A case for multi-level main memory 多级主存储器的一个实例

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054944

M. Ekman, P. Stenström

Current trends suggest that the number of memory chips per processor chip will increase at least a factor of ten in seven years. This will make DRAM cost, the space and the power it consumes a serious problem. The main question raised in this research is how cost, size, and power consumption can be reduced by transforming traditional flat main-memory systems into a multi-level hierarchy. We make the case for a multi-level main memory hierarchy by proposing and evaluating the performance of an implementation that enables aggressive use of memory compression, sharing of memory resources among computers, and dynamic power management of unused regions of memory. This paper presents the key design strategies to make this happen. We evaluate our implementation using complete runs of applications from the Spec 2K suite, SpecJBB, and SAP --- typical desktop and server applications. We show that only 30% of the entire memory resources typically needed must be accessed at DRAM speed whereas the rest can be accessed at a speed that is a magnitude slower. The resulting performance overhead is shown to be only 1.2% on average.

目前的趋势表明，每个处理器芯片的存储芯片数量将在七年内至少增加十倍。这将使DRAM的成本、空间和功耗成为一个严重的问题。本研究提出的主要问题是如何通过将传统的平面主存系统转变为多层次的层次结构来降低成本、尺寸和功耗。通过提出和评估一种实现的性能，我们提出了一个多级主内存层次结构，该实现能够积极使用内存压缩，在计算机之间共享内存资源，以及对未使用的内存区域进行动态电源管理。本文介绍了实现这一目标的关键设计策略。我们使用来自Spec 2K套件、SpecJBB和SAP的应用程序的完整运行来评估我们的实现——典型的桌面和服务器应用程序。我们表明，通常需要的整个内存资源中只有30%必须以DRAM速度访问，而其余的可以以慢一个数量级的速度访问。由此产生的性能开销平均仅为1.2%。

引用次数: 18

Evaluating kilo-instruction multiprocessors 求千指令多处理器

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054953

M. Galluzzi, R. Beivide, Valentin Puente, J. Gregorio, A. Cristal, M. Valero

The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies by having thousands of in-flight instructions. Current multiprocessor systems also have to deal with this increasing memory latency while facing other sources of latencies: those coming from communication among processors. What we propose, in this paper, is the use of Kilo-instruction processors as computing nodes for small-scale CCNUMA multiprocessors. We evaluate what we appropriately call Kilo-instruction Multiprocessors. This kind of systems appears to achieve very good performance while showing two interesting behaviours. First, the great amount of in-flight instructions makes the system not just to hide the latencies coming from the memory accesses but also the inherent communication latencies involved in remote memory accesses. Second, the significant pressure imposed by many in-flight instructions translates into a very high contention for the interconnection network, what indicates us that more efforts need to be employed in designing routers capable of managing high traffic levels.

处理器和内存速度之间不断扩大的差距对性能产生了非常负面的影响。克服这个问题的一个可能的解决方案是千指令处理器。这是最近提出的一种架构，能够通过拥有数千个飞行指令来隐藏大量内存延迟。当前的多处理器系统还必须处理不断增加的内存延迟，同时还要面对其他延迟来源:来自处理器之间通信的延迟。在本文中，我们建议使用千指令处理器作为小型CCNUMA多处理器的计算节点。我们对所谓的千指令多处理器进行评估。这类系统在表现出两种有趣行为的同时，似乎取得了非常好的性能。首先，大量的飞行指令使得系统不仅要隐藏来自内存访问的延迟，而且要隐藏远程内存访问所涉及的固有通信延迟。其次，许多飞行指令所施加的巨大压力转化为对互连网络的高度竞争，这表明我们需要更多的努力来设计能够管理高流量水平的路由器。

{"title":"Evaluating kilo-instruction multiprocessors","authors":"M. Galluzzi, R. Beivide, Valentin Puente, J. Gregorio, A. Cristal, M. Valero","doi":"10.1145/1054943.1054953","DOIUrl":"https://doi.org/10.1145/1054943.1054953","url":null,"abstract":"The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies by having thousands of in-flight instructions. Current multiprocessor systems also have to deal with this increasing memory latency while facing other sources of latencies: those coming from communication among processors. What we propose, in this paper, is the use of Kilo-instruction processors as computing nodes for small-scale CCNUMA multiprocessors. We evaluate what we appropriately call Kilo-instruction Multiprocessors. This kind of systems appears to achieve very good performance while showing two interesting behaviours. First, the great amount of in-flight instructions makes the system not just to hide the latencies coming from the memory accesses but also the inherent communication latencies involved in remote memory accesses. Second, the significant pressure imposed by many in-flight instructions translates into a very high contention for the interconnection network, what indicates us that more efforts need to be employed in designing routers capable of managing high traffic levels.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122634974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A low-power memory hierarchy for a fully programmable baseband processor 用于完全可编程基带处理器的低功耗存储器层次结构

Workshop on Memory Performance Issues

Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054957

W. Raab, Hans-Martin Blüthgen, U. Ramacher

Future terminals for wireless communication not only must support multiple standards but execute several of them concurrently. To meet these requirements, flexibility and ease of programming of integrated circuits for digital baseband processing are increasingly important criteria for the deployment of such devices, while power consumption and area of the devices remain as critical as in the past.The paper presents the architecture of a fully programmable system-on-chip for digital signal processing in the baseband of contemporary and up-coming standards for wireless communication. Particular focus is given to the memory hierarchy of the multi-processor system and the measures to minimize the power it dissipates. The reduction of the power consumption of the entire chip is estimated to amount to 28% compared to a straightforward approach.

未来的无线通信终端不仅要支持多个标准，而且要同时执行多个标准。为了满足这些要求，用于数字基带处理的集成电路的灵活性和易于编程性日益成为部署此类器件的重要标准，而器件的功耗和面积仍然像过去一样至关重要。本文介绍了一种完全可编程的片上系统的体系结构，用于现代和未来无线通信标准基带的数字信号处理。重点讨论了多处理器系统的内存层次结构，以及如何使其功耗最小化。与直接的方法相比，整个芯片的功耗估计减少了28%。

引用次数: 5

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Workshop on Memory Performance Issues

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀