首页 > 最新文献

Workshop on Memory Performance Issues最新文献

英文 中文
Addressing mode driven low power data caches for embedded processors 用于嵌入式处理器的寻址模式驱动的低功耗数据缓存
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054961
R. Peri, John Fernando, R. Kolagotla
The size and speed of first-level caches and SRAMs of embedded processors continue to increase in response to demands for higher performance. In power-sensitive devices like PDAs and cellular handsets, decreasing power consumption while increasing performance is desirable. Contemporary caches typically exploit locality in memory access patterns but do not exploit locality information encoded in addressing modes used to access memory. We present two schemes that use locality information inherent in memory addressing modes to reduce power consumption of cache or SRAM nearest to the processor. The level-0 data buffer scheme introduces a set of data buffers controlled by the addressing mode to eliminate over a third of all reads to the next level of memory (cache or SRAM). These buffers can also reduce load-use penalty in processors with long load pipelines. The address register tag-buffer scheme exploits the addressing mode to reduce tag array look-up in set associative first-level caches.
为了响应更高性能的需求,嵌入式处理器的一级缓存和ram的大小和速度不断增加。在pda和蜂窝电话等对功率敏感的设备中,希望在提高性能的同时降低功耗。当代缓存通常利用内存访问模式中的局部性,但不利用用于访问内存的寻址模式中编码的局部性信息。我们提出了两种方案,利用内存寻址模式中固有的位置信息来降低离处理器最近的缓存或SRAM的功耗。0级数据缓冲区方案引入了一组由寻址模式控制的数据缓冲区,以消除对下一级内存(缓存或SRAM)的所有读取的三分之一以上。这些缓冲区还可以减少具有长负载管道的处理器中的负载使用损失。地址寄存器标签缓冲方案利用寻址模式减少了在集合关联一级缓存中标签数组查找。
{"title":"Addressing mode driven low power data caches for embedded processors","authors":"R. Peri, John Fernando, R. Kolagotla","doi":"10.1145/1054943.1054961","DOIUrl":"https://doi.org/10.1145/1054943.1054961","url":null,"abstract":"The size and speed of first-level caches and SRAMs of embedded processors continue to increase in response to demands for higher performance. In power-sensitive devices like PDAs and cellular handsets, decreasing power consumption while increasing performance is desirable. Contemporary caches typically exploit locality in memory access patterns but do not exploit locality information encoded in addressing modes used to access memory. We present two schemes that use locality information inherent in memory addressing modes to reduce power consumption of cache or SRAM nearest to the processor. The level-0 data buffer scheme introduces a set of data buffers controlled by the addressing mode to eliminate over a third of all reads to the next level of memory (cache or SRAM). These buffers can also reduce load-use penalty in processors with long load pipelines. The address register tag-buffer scheme exploits the addressing mode to reduce tag array look-up in set associative first-level caches.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124957609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A study of performance impact of memory controller features in multi-processor server environment 多处理器服务器环境下内存控制器特性对性能影响的研究
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054954
C. Natarajan, Bruce Christenson, F. Briggs
With the growing imbalance between processor and memory performance it becomes more and more important to optimize the memory controller features to obtain the maximum possible performance out of the memory subsystem. This paper presents a study of the performance impact of several memory controller features in multi-processor (MP) server environments that use a DDR/DDR2 based memory subsystem. The results from our studies show that significant performance improvements can be obtained by carefully optimizing the memory controller features. For instance, one of our studies shows that in a system with an in-order shared bus connecting the CPUs and memory controller, an intelligent read-to-write switching memory controller feature can provide the same order of benefit as doubling the number of interleaved memory ranks. Another study shows that much lower average loaded read latency across a wider range of throughput can be obtained by a delayed write scheduling feature.
随着处理器和存储器性能之间的不平衡日益加剧,优化存储器控制器特性以获得存储器子系统的最大可能性能变得越来越重要。本文研究了在使用基于DDR/DDR2的内存子系统的多处理器(MP)服务器环境中几种内存控制器特性对性能的影响。我们的研究结果表明,通过仔细优化内存控制器特性可以获得显着的性能改进。例如,我们的一项研究表明,在一个有顺序共享总线连接cpu和内存控制器的系统中,智能读写切换内存控制器特性可以提供与交错内存秩加倍数量相同的好处。另一项研究表明,延迟写调度特性可以在更大的吞吐量范围内获得更低的平均负载读延迟。
{"title":"A study of performance impact of memory controller features in multi-processor server environment","authors":"C. Natarajan, Bruce Christenson, F. Briggs","doi":"10.1145/1054943.1054954","DOIUrl":"https://doi.org/10.1145/1054943.1054954","url":null,"abstract":"With the growing imbalance between processor and memory performance it becomes more and more important to optimize the memory controller features to obtain the maximum possible performance out of the memory subsystem. This paper presents a study of the performance impact of several memory controller features in multi-processor (MP) server environments that use a DDR/DDR2 based memory subsystem. The results from our studies show that significant performance improvements can be obtained by carefully optimizing the memory controller features. For instance, one of our studies shows that in a system with an in-order shared bus connecting the CPUs and memory controller, an intelligent read-to-write switching memory controller feature can provide the same order of benefit as doubling the number of interleaved memory ranks. Another study shows that much lower average loaded read latency across a wider range of throughput can be obtained by a delayed write scheduling feature.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133344514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
The Opie compiler from row-major source to Morton-ordered matrices 从行为主源到莫顿有序矩阵的Opie编译器
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054962
Steven T. Gabriel, David S. Wise
The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalent codes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance we seek to compile a library of usable code to support future development of new algorithms better suited to Morton-ordered matrices.This paper reports the formalism behind the OPIE compiler for C, its status: now compiling several standard Level-2 and Level-3 linear algebra operations, and a demonstration of a breakthrough reflected in a huge reduction of L1, L2, TLB misses. Overall performance improves on the Intel Xeon architecture.
Opie项目的目标是开发一个编译器,将为行主矩阵表示编写的C代码转换为等效的莫顿阶矩阵表示代码,并将其技术应用于其他语言。接受可能的性能下降,我们寻求编译一个可用代码库,以支持未来更适合莫顿有序矩阵的新算法的开发。本文报告了C语言OPIE编译器背后的形式化,它的状态:现在编译了几个标准的二级和三级线性代数运算,并展示了一个突破,反映在L1, L2, TLB错误的大量减少。整体性能在英特尔至强架构上有所提高。
{"title":"The Opie compiler from row-major source to Morton-ordered matrices","authors":"Steven T. Gabriel, David S. Wise","doi":"10.1145/1054943.1054962","DOIUrl":"https://doi.org/10.1145/1054943.1054962","url":null,"abstract":"The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalent codes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance we seek to compile a library of usable code to support future development of new algorithms better suited to Morton-ordered matrices.This paper reports the formalism behind the OPIE compiler for C, its status: now compiling several standard Level-2 and Level-3 linear algebra operations, and a demonstration of a breakthrough reflected in a huge reduction of L1, L2, TLB misses. Overall performance improves on the Intel Xeon architecture.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"47 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130807587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Cache organizations for clustered microarchitectures 集群微架构的缓存组织
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054950
José González, Fernando Latorre, Antonio González
Clustered microarchitectures are an effective organization to deal with the problem of wire delays and complexity by partitioning some of the processor resources. The organization of the data cache is a key factor in these processors due to its effect on cache miss rate and inter-cluster communications. This paper investigates alternative designs of the data cache: centralized, distributed, replicated and physically distributed cache architectures are analyzed. Results show similar average performance but significant performance variations depending on the application features, specially cache miss ratio and communications. In addition, we also propose a novel instruction steering scheme in order to reduce communications. This scheme conditionally stalls the dispatch of instructions depending on the occupancy of the clusters, whenever the current instruction cannot be steered to the cluster holding most of the inputs. This new steering outperforms traditional schemes. Results show, an average speedup of 5% and up to 15% for some applications.
集群微体系结构是一种有效的组织,通过对一些处理器资源进行分区来处理线路延迟和复杂性问题。数据缓存的组织是这些处理器中的一个关键因素,因为它影响缓存丢失率和集群间通信。本文研究了数据缓存的备选设计:集中式、分布式、复制式和物理分布式缓存架构进行了分析。结果显示了类似的平均性能,但根据应用程序的特性,特别是缓存丢失率和通信,性能会有很大的变化。此外,我们还提出了一种新的指令转向方案,以减少通信。每当当前指令不能被引导到拥有大部分输入的集群时,该方案就会根据集群的占用情况有条件地停止指令的分发。这种新的转向方式优于传统的转向方式。结果表明,在某些应用程序中,平均加速率为5%,最高可达15%。
{"title":"Cache organizations for clustered microarchitectures","authors":"José González, Fernando Latorre, Antonio González","doi":"10.1145/1054943.1054950","DOIUrl":"https://doi.org/10.1145/1054943.1054950","url":null,"abstract":"Clustered microarchitectures are an effective organization to deal with the problem of wire delays and complexity by partitioning some of the processor resources. The organization of the data cache is a key factor in these processors due to its effect on cache miss rate and inter-cluster communications. This paper investigates alternative designs of the data cache: centralized, distributed, replicated and physically distributed cache architectures are analyzed. Results show similar average performance but significant performance variations depending on the application features, specially cache miss ratio and communications. In addition, we also propose a novel instruction steering scheme in order to reduce communications. This scheme conditionally stalls the dispatch of instructions depending on the occupancy of the clusters, whenever the current instruction cannot be steered to the cluster holding most of the inputs. This new steering outperforms traditional schemes. Results show, an average speedup of 5% and up to 15% for some applications.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133599574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
An analytical model for software-only main memory compression 纯软件主存压缩的分析模型
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054958
I. Tuduce, T. Gross
Many applications with large data spaces that cannot run on a typical workstation (due to page faults) call for techniques to expand the effective memory size. One such technique is memory compression.Understanding what applications under what conditions can benefit from main memory compression is complicated due to various tradeoffs and the dynamic characteristics of applications. For instance, a large area to store compressed data increases the effective memory size considerably but also decreases the amount of memory that can hold uncompressed data.This paper presents an analytical model that states the conditions for a compressed-memory system to yield performance improvements. Parameters of the model are the compression algorithm efficiency, the amount of data being compressed, and the application memory access pattern. Such a model can be used by an operating system to compute the size of the compressed-memory level that can improve an application's performance.
许多具有大型数据空间的应用程序无法在典型的工作站上运行(由于页面错误),因此需要扩展有效内存大小的技术。其中一种技术就是内存压缩。由于各种权衡和应用程序的动态特性,理解在什么条件下哪些应用程序可以从主存压缩中受益是很复杂的。例如,用于存储压缩数据的大区域会大大增加有效内存大小,但也会减少可以保存未压缩数据的内存量。本文提出了一个分析模型,说明了压缩存储系统产生性能改进的条件。该模型的参数包括压缩算法的效率、压缩的数据量和应用程序的内存访问模式。操作系统可以使用这种模型来计算压缩内存级别的大小,从而提高应用程序的性能。
{"title":"An analytical model for software-only main memory compression","authors":"I. Tuduce, T. Gross","doi":"10.1145/1054943.1054958","DOIUrl":"https://doi.org/10.1145/1054943.1054958","url":null,"abstract":"Many applications with large data spaces that cannot run on a typical workstation (due to page faults) call for techniques to expand the effective memory size. One such technique is memory compression.Understanding what applications under what conditions can benefit from main memory compression is complicated due to various tradeoffs and the dynamic characteristics of applications. For instance, a large area to store compressed data increases the effective memory size considerably but also decreases the amount of memory that can hold uncompressed data.This paper presents an analytical model that states the conditions for a compressed-memory system to yield performance improvements. Parameters of the model are the compression algorithm efficiency, the amount of data being compressed, and the application memory access pattern. Such a model can be used by an operating system to compute the size of the compressed-memory level that can improve an application's performance.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127418165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A low cost, multithreaded processing-in-memory system 一种低成本、多线程的内存处理系统
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054946
J. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, P. Kogge
This paper discusses die cost vs. performance tradeoffs for a PIM system that could serve as the memory system of a host processor. For an increase of less than twice the cost of a commodity DRAM part, it is possible to realize a performance speedup of nearly a factor of 4 on irregular applications. This cost efficiency derives from developing a custom multithreaded processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design --- reducing area --- as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite.
本文讨论了可作为主处理器内存系统的PIM系统的芯片成本与性能权衡。增加不到商品DRAM部件成本的两倍,就有可能在不规则应用中实现近4倍的性能加速。这种成本效率源于开发一种定制的多线程处理器架构和实现风格,它非常适合嵌入到内存中。具体来说,它利用低延迟和高行带宽来简化处理器设计-减少面积-以及提高处理吞吐量。为了支持我们对成本和性能的要求,我们对现有芯片进行了仿真,分析,并设计并完全实现了一个原型芯片,PIM Lite。
{"title":"A low cost, multithreaded processing-in-memory system","authors":"J. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, P. Kogge","doi":"10.1145/1054943.1054946","DOIUrl":"https://doi.org/10.1145/1054943.1054946","url":null,"abstract":"This paper discusses die cost vs. performance tradeoffs for a PIM system that could serve as the memory system of a host processor. For an increase of less than twice the cost of a commodity DRAM part, it is possible to realize a performance speedup of nearly a factor of 4 on irregular applications. This cost efficiency derives from developing a custom multithreaded processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design --- reducing area --- as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123655169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
A compressed memory hierarchy using an indirect index cache 使用间接索引缓存的压缩内存层次结构
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054945
Erik G. Hallnor, S. Reinhardt
The large and growing impact of memory hierarchies on overall system performance compels designers to investigate innovative techniques to improve memory-system efficiency. We propose and analyze a memory hierarchy that increases both the effective capacity of memory structures and the effective bandwidth of interconnects by storing and transmitting data in compressed form.Caches play a key role in hiding memory latencies. However, cache sizes are constrained by die area and cost. A cache's effective size can be increased by storing compressed data, if the storage unused by a compressed block can be allocated to other blocks. We use a modified Indirect Index Cache to allocate variable amounts of storage to different blocks, depending on their compressibility.By coupling our compressed cache design with a similarly compressed main memory, we can easily transfer data between these structures in a compressed state, increasing the effective memory bus bandwidth. This optimization further improves performance when bus bandwidth is critical.Our simulation results, using the SPEC CPU2000 benchmarks, show that our design increases performance by up to 225% on some benchmarks while degrading performance in general by no more than 2%, other than a 12% decrease on a single benchmark. Compressed bus transfers alone account for up to 80% of this improvement, with the remainder coming from increased effective cache capacity. As memory latencies increase, our design becomes even more beneficial.
内存层次结构对整个系统性能的巨大且不断增长的影响迫使设计人员研究创新技术来提高内存系统的效率。我们提出并分析了一种存储器层次结构,它通过压缩形式存储和传输数据来增加存储器结构的有效容量和互连的有效带宽。缓存在隐藏内存延迟方面起着关键作用。然而,高速缓存的大小受到芯片面积和成本的限制。如果可以将压缩块未使用的存储空间分配给其他块,则可以通过存储压缩数据来增加缓存的有效大小。我们使用修改后的间接索引缓存,根据不同块的可压缩性为它们分配可变的存储量。通过将我们的压缩缓存设计与类似的压缩主存相结合,我们可以轻松地在压缩状态下在这些结构之间传输数据,从而增加有效的内存总线带宽。当总线带宽至关重要时,此优化进一步提高了性能。我们使用SPEC CPU2000基准测试的模拟结果表明,我们的设计在某些基准测试上将性能提高了225%,而性能下降一般不超过2%,除了单个基准测试下降12%之外。仅压缩总线传输就占到这一改进的80%,其余部分来自有效缓存容量的增加。随着内存延迟的增加,我们的设计变得更加有益。
{"title":"A compressed memory hierarchy using an indirect index cache","authors":"Erik G. Hallnor, S. Reinhardt","doi":"10.1145/1054943.1054945","DOIUrl":"https://doi.org/10.1145/1054943.1054945","url":null,"abstract":"The large and growing impact of memory hierarchies on overall system performance compels designers to investigate innovative techniques to improve memory-system efficiency. We propose and analyze a memory hierarchy that increases both the effective capacity of memory structures and the effective bandwidth of interconnects by storing and transmitting data in compressed form.Caches play a key role in hiding memory latencies. However, cache sizes are constrained by die area and cost. A cache's effective size can be increased by storing compressed data, if the storage unused by a compressed block can be allocated to other blocks. We use a modified Indirect Index Cache to allocate variable amounts of storage to different blocks, depending on their compressibility.By coupling our compressed cache design with a similarly compressed main memory, we can easily transfer data between these structures in a compressed state, increasing the effective memory bus bandwidth. This optimization further improves performance when bus bandwidth is critical.Our simulation results, using the SPEC CPU2000 benchmarks, show that our design increases performance by up to 225% on some benchmarks while degrading performance in general by no more than 2%, other than a 12% decrease on a single benchmark. Compressed bus transfers alone account for up to 80% of this improvement, with the remainder coming from increased effective cache capacity. As memory latencies increase, our design becomes even more beneficial.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123392756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
SCIMA-SMP: on-chip memory processor architecture for SMP SCIMA-SMP:用于SMP的片上存储器处理器架构
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054960
C. Takahashi, Masaaki Kondo, T. Boku, D. Takahashi, Hiroshi Nakamura, M. Sato
In this paper, we propose a processor architecture with programmable on-chip memory for a high-performance SMP (symmetric multi-processor) node named SCIMA-SMP (Software Controlled Integrated Memory Architecture for SMP) with the intent of solving the performance gap problem between a processor and off-chip memory. With special instructions which enable the explicit data transfer between on-chip memory and off-chip memory, this architecture is able to control the data transfer timing and its granularity by the application program, and the SMP bus is utilized efficiently compared with traditional cache-only architecture. Through the performance evaluation based on clock-level simulation for various HPC applications, we confirmed that this architecture largely reduces the bus access cycle by avoiding redundant data transfer and controlling the granularity of the data movement between on-chip and off-chip memory.
在本文中,我们提出了一种具有可编程片上存储器的处理器架构,用于高性能SMP(对称多处理器)节点,称为SCIMA-SMP (SMP的软件控制集成存储器架构),旨在解决处理器和片外存储器之间的性能差距问题。该体系结构通过特殊的指令实现片内存储器和片外存储器之间的显式数据传输,可以由应用程序控制数据传输的时间和粒度,与传统的纯缓存体系结构相比,有效地利用了SMP总线。通过对各种高性能计算应用的时钟级仿真性能评估,我们证实该架构通过避免冗余数据传输和控制片内和片外存储器之间数据移动的粒度,大大缩短了总线访问周期。
{"title":"SCIMA-SMP: on-chip memory processor architecture for SMP","authors":"C. Takahashi, Masaaki Kondo, T. Boku, D. Takahashi, Hiroshi Nakamura, M. Sato","doi":"10.1145/1054943.1054960","DOIUrl":"https://doi.org/10.1145/1054943.1054960","url":null,"abstract":"In this paper, we propose a processor architecture with programmable on-chip memory for a high-performance SMP (symmetric multi-processor) node named SCIMA-SMP (Software Controlled Integrated Memory Architecture for SMP) with the intent of solving the performance gap problem between a processor and off-chip memory. With special instructions which enable the explicit data transfer between on-chip memory and off-chip memory, this architecture is able to control the data transfer timing and its granularity by the application program, and the SMP bus is utilized efficiently compared with traditional cache-only architecture. Through the performance evaluation based on clock-level simulation for various HPC applications, we confirmed that this architecture largely reduces the bus access cycle by avoiding redundant data transfer and controlling the granularity of the data movement between on-chip and off-chip memory.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122293411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A localizing directory coherence protocol 本地化目录一致性协议
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054947
Collin McCurdy, C. Fischer
User-controllable coherence revives the idea of cooperation between software and hardware in an attempt to bridge the gap between efficient small-scale shared memory machines and massive distributed memory machines. It proposes a new multiprocessor architecture which has both a global address-space and multiple processor-local address-spaces with new memory instructions and a new coherence protocol to manage the dual address-spaces.The purpose of this paper is twofold. First, we solidify the semantics of instruction set extensions that enable "localization" -- the act of moving data from the global address-space to a processor's local address-space -- thus clearly defining the requirements for a localizing coherence protocol. Second, we demonstrate the feasibility of localizing coherence by describing the workings of a full-scale directory-based protocol that we have implemented and tested using an existing protocol specification tool.
用户可控制的一致性恢复了软件和硬件之间合作的想法,试图弥合高效的小规模共享内存机器和大规模分布式内存机器之间的差距。提出了一种具有全局地址空间和多处理器本地地址空间的新多处理器架构,该架构采用新的内存指令和新的一致性协议来管理双地址空间。本文的目的是双重的。首先,我们巩固了支持“本地化”(将数据从全局地址空间移动到处理器本地地址空间的行为)的指令集扩展的语义,从而明确定义了本地化一致性协议的要求。其次,我们通过描述我们已经使用现有协议规范工具实现和测试的基于目录的全面协议的工作原理来证明本地化一致性的可行性。
{"title":"A localizing directory coherence protocol","authors":"Collin McCurdy, C. Fischer","doi":"10.1145/1054943.1054947","DOIUrl":"https://doi.org/10.1145/1054943.1054947","url":null,"abstract":"User-controllable coherence revives the idea of cooperation between software and hardware in an attempt to bridge the gap between efficient small-scale shared memory machines and massive distributed memory machines. It proposes a new multiprocessor architecture which has both a global address-space and multiple processor-local address-spaces with new memory instructions and a new coherence protocol to manage the dual address-spaces.The purpose of this paper is twofold. First, we solidify the semantics of instruction set extensions that enable \"localization\" -- the act of moving data from the global address-space to a processor's local address-space -- thus clearly defining the requirements for a localizing coherence protocol. Second, we demonstrate the feasibility of localizing coherence by describing the workings of a full-scale directory-based protocol that we have implemented and tested using an existing protocol specification tool.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133592251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scalable cache memory design for large-scale SMT architectures 大规模SMT架构的可扩展高速缓存设计
Pub Date : 2004-06-20 DOI: 10.1145/1054943.1054952
M. Mudawar
The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for band-width. The size of the L1 data cache did not scale over the past decade. Instead, larger unified L2 and L3 caches were introduced. This cache hierarchy has a high overhead due to the principle of containment. It also has a complex design to maintain cache coherence across all levels. Furthermore, this cache hierarchy is not suitable for future large-scale SMT processors, which will demand high bandwidth instruction and data caches with a large number of ports.This paper suggests the elimination of the cache hierarchy and replacing it with one-level caches for instruction and data. Multiple instruction caches can be used in parallel to scale the instruction fetch bandwidth and the overall cache capacity. A one-level data cache can be split into a number of block-interleaved cache banks to serve multiple memory requests in parallel. An interconnect is used to connect the data cache ports to the different cache banks, thus increasing the data cache access time. This paper shows that large-scale SMTs can tolerate long data cache hit times. It also shows that small line buffers can enhance the performance and reduce the required number of ports to the banked data cache memory.
现有SMT和超标量处理器中的缓存层次结构设计针对延迟进行了优化,但没有针对带宽进行优化。L1数据缓存的大小在过去十年中没有扩展。取而代之的是引入了更大的统一L2和L3缓存。由于包含原则,此缓存层次结构具有很高的开销。它也有一个复杂的设计,以保持所有级别的缓存一致性。此外,这种缓存层次结构不适合未来的大规模SMT处理器,这将需要具有大量端口的高带宽指令和数据缓存。本文建议取消缓存层次结构,代之以指令和数据的一级缓存。多个指令缓存可以并行使用,以扩展指令获取带宽和总体缓存容量。一级数据缓存可以被分割成多个块交错缓存库,以并行地服务多个内存请求。通过互连将数据缓存端口与不同的缓存银行连接起来,从而增加数据缓存访问时间。本文表明大规模smt可以容忍较长的数据缓存命中时间。它还表明,较小的行缓冲区可以提高性能并减少到存储数据缓存存储器所需的端口数量。
{"title":"Scalable cache memory design for large-scale SMT architectures","authors":"M. Mudawar","doi":"10.1145/1054943.1054952","DOIUrl":"https://doi.org/10.1145/1054943.1054952","url":null,"abstract":"The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for band-width. The size of the L1 data cache did not scale over the past decade. Instead, larger unified L2 and L3 caches were introduced. This cache hierarchy has a high overhead due to the principle of containment. It also has a complex design to maintain cache coherence across all levels. Furthermore, this cache hierarchy is not suitable for future large-scale SMT processors, which will demand high bandwidth instruction and data caches with a large number of ports.This paper suggests the elimination of the cache hierarchy and replacing it with one-level caches for instruction and data. Multiple instruction caches can be used in parallel to scale the instruction fetch bandwidth and the overall cache capacity. A one-level data cache can be split into a number of block-interleaved cache banks to serve multiple memory requests in parallel. An interconnect is used to connect the data cache ports to the different cache banks, thus increasing the data cache access time. This paper shows that large-scale SMTs can tolerate long data cache hit times. It also shows that small line buffers can enhance the performance and reduce the required number of ports to the banked data cache memory.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"60 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129723096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Workshop on Memory Performance Issues
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1