首页 > 最新文献

Proceedings of the 2015 International Symposium on Memory Systems最新文献

英文 中文
High Performance Computing Co-Design Strategies 高性能计算协同设计策略
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818959
J. Ang
The MEMSYS Call for Papers contains this passage: Many of the problems we see in the memory system are cross-disciplinary in nature -- their solution would likely require work at all levels, from applications to circuits. Thus, while the scope of the problem is memory, the scope of the solutions will be much wider. The Department of Energy's (DOE) high performance computing (HPC) community is thinking about how to define, support and execute work at all levels for the development of future supercomputers to run our portfolio of mission applications. Borrowing a concept from embedded computing, the DOE HPC community is calling our work at all levels co-design [1]. Co-design for embedded computing is focused on hardware/software partitioning of activities to execute a well-defined task within specific constraints. Co-design for general-purpose HPC has many dimensions for both the work to be performed and the constraints, e.g. hardware designs, runtime software, applications and algorithms. The subject of this extended abstract is a description of two alternative DOE HPC co-design strategies. While DOE co-design efforts include more than the memory system, as noted in the MEMSYS call, the memory system impacts applications, circuits and all levels between.
MEMSYS征文中包含了这样一段话:我们在存储系统中看到的许多问题本质上都是跨学科的——它们的解决方案可能需要从应用到电路的各个层面的工作。因此,虽然问题的范围是内存,但解决方案的范围将更广。美国能源部(DOE)高性能计算(HPC)社区正在考虑如何定义、支持和执行各级工作,以开发未来的超级计算机,以运行我们的任务应用组合。借用嵌入式计算的概念,DOE HPC社区将我们在各个层面的工作称为协同设计[1]。嵌入式计算的协同设计侧重于活动的硬件/软件划分,以便在特定的约束条件下执行定义良好的任务。通用高性能计算的协同设计在工作和约束方面都有很多方面,例如硬件设计、运行时软件、应用程序和算法。这个扩展摘要的主题是描述两种可选的DOE高性能计算协同设计策略。虽然美国能源部的协同设计工作不仅仅包括存储系统,正如MEMSYS呼叫中所指出的那样,存储系统影响应用、电路以及两者之间的所有层面。
{"title":"High Performance Computing Co-Design Strategies","authors":"J. Ang","doi":"10.1145/2818950.2818959","DOIUrl":"https://doi.org/10.1145/2818950.2818959","url":null,"abstract":"The MEMSYS Call for Papers contains this passage: Many of the problems we see in the memory system are cross-disciplinary in nature -- their solution would likely require work at all levels, from applications to circuits. Thus, while the scope of the problem is memory, the scope of the solutions will be much wider. The Department of Energy's (DOE) high performance computing (HPC) community is thinking about how to define, support and execute work at all levels for the development of future supercomputers to run our portfolio of mission applications. Borrowing a concept from embedded computing, the DOE HPC community is calling our work at all levels co-design [1]. Co-design for embedded computing is focused on hardware/software partitioning of activities to execute a well-defined task within specific constraints. Co-design for general-purpose HPC has many dimensions for both the work to be performed and the constraints, e.g. hardware designs, runtime software, applications and algorithms. The subject of this extended abstract is a description of two alternative DOE HPC co-design strategies. While DOE co-design efforts include more than the memory system, as noted in the MEMSYS call, the memory system impacts applications, circuits and all levels between.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123830751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Anatomy of GPU Memory System for Multi-Application Execution 多应用程序执行的GPU内存系统剖析
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818979
Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, S. Keckler, M. Kandemir, C. Das
As GPUs make headway in the computing landscape spanning mobile platforms, supercomputers, cloud and virtual desktop platforms, supporting concurrent execution of multiple applications in GPUs becomes essential for unlocking their full potential. However, unlike CPUs, multi-application execution in GPUs is little explored. In this paper, we study the memory system of GPUs in a concurrently executing multi-application environment. We first present an analytical performance model for many-threaded architectures and show that the common use of misses-per-kilo-instruction (MPKI) as a proxy for performance is not accurate without considering the bandwidth usage of applications. We characterize the memory interference of applications and discuss the limitations of existing memory schedulers in mitigating this interference. We extend the analytical model to multiple applications and identify the key metrics to control various performance metrics. We conduct extensive simulations using an enhanced version of GPGPU-Sim targeted for concurrently executing multiple applications, and show that memory scheduling decisions based on MPKI and bandwidth information are more effective in enhancing throughput compared to the traditional FR-FCFS and the recently proposed RR FR-FCFS policies.
随着gpu在跨越移动平台、超级计算机、云和虚拟桌面平台的计算领域取得进展,支持gpu中多个应用程序的并发执行对于释放其全部潜力至关重要。然而,与cpu不同,gpu中的多应用程序执行很少被探索。本文研究了gpu在并发多应用环境下的存储系统。我们首先提出了多线程体系结构的分析性能模型,并表明,如果不考虑应用程序的带宽使用,通常使用每千指令缺失量(MPKI)作为性能的代理是不准确的。我们描述了应用程序的内存干扰,并讨论了现有内存调度器在减轻这种干扰方面的局限性。我们将分析模型扩展到多个应用程序,并确定控制各种性能指标的关键指标。我们使用针对并发执行多个应用程序的增强版GPGPU-Sim进行了大量仿真,结果表明,与传统的FR-FCFS和最近提出的RR FR-FCFS策略相比,基于MPKI和带宽信息的内存调度决策在提高吞吐量方面更有效。
{"title":"Anatomy of GPU Memory System for Multi-Application Execution","authors":"Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, S. Keckler, M. Kandemir, C. Das","doi":"10.1145/2818950.2818979","DOIUrl":"https://doi.org/10.1145/2818950.2818979","url":null,"abstract":"As GPUs make headway in the computing landscape spanning mobile platforms, supercomputers, cloud and virtual desktop platforms, supporting concurrent execution of multiple applications in GPUs becomes essential for unlocking their full potential. However, unlike CPUs, multi-application execution in GPUs is little explored. In this paper, we study the memory system of GPUs in a concurrently executing multi-application environment. We first present an analytical performance model for many-threaded architectures and show that the common use of misses-per-kilo-instruction (MPKI) as a proxy for performance is not accurate without considering the bandwidth usage of applications. We characterize the memory interference of applications and discuss the limitations of existing memory schedulers in mitigating this interference. We extend the analytical model to multiple applications and identify the key metrics to control various performance metrics. We conduct extensive simulations using an enhanced version of GPGPU-Sim targeted for concurrently executing multiple applications, and show that memory scheduling decisions based on MPKI and bandwidth information are more effective in enhancing throughput compared to the traditional FR-FCFS and the recently proposed RR FR-FCFS policies.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"68 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125378806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs 省略刷新:商品和宽I/O dram的案例研究
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818964
Matthias Jung, Éder F. Zulian, Deepak M. Mathew, M. Herrmann, Christian Brugger, C. Weis, N. Wehn
Dynamic Random Access Memories (DRAM) have a big impact on performance and contribute significantly to the total power consumption in systems ranging from mobile devices to servers. Up to half of the power consumption of future high density DRAM devices will be caused by refresh commands. Moreover, not only the refresh rate does depend on the device capacity, but it strongly depends on the temperature as well. In case of 3D integration of MPSoCs with Wide I/O DRAMs the power density and thermal dissipation are increased dramatically. Hence, in 3D-DRAM even more DRAM refresh operations are required. To master these challenges, clever DRAM refresh strategies are mandatory either on hardware or on software level using new or already available infrastructures and implementations, such as Partial Array Self Refresh (PASR) or Temperature Compensated Self Refresh (TCSR). In this paper, we show that for dedicated applications refresh can be disabled completely without or with negligible impact on the application performance. This is possible if it is assured that either the lifetime of the data is shorter than the currently required DRAM refresh period or if the application can tolerate bit errors to some degree in a given time window.
动态随机存取存储器(DRAM)对性能有很大影响,并且对从移动设备到服务器的系统的总功耗有很大贡献。在未来的高密度DRAM设备中,高达一半的功耗将由刷新命令引起。此外,刷新率不仅与设备容量有关,还与温度密切相关。当mpsoc与Wide I/O dram 3D集成时,功率密度和散热显著提高。因此,在3D-DRAM中甚至需要更多的DRAM刷新操作。为了应对这些挑战,聪明的DRAM刷新策略必须在硬件或软件层面上使用新的或已经可用的基础设施和实现,例如部分阵列自我刷新(PASR)或温度补偿自我刷新(TCSR)。在本文中,我们展示了对于专用应用程序,可以完全禁用刷新,而不会或对应用程序性能的影响可以忽略不计。如果可以确保数据的生命周期短于当前所需的DRAM刷新周期,或者应用程序可以在给定的时间窗口内容忍一定程度的位错误,那么这是可能的。
{"title":"Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs","authors":"Matthias Jung, Éder F. Zulian, Deepak M. Mathew, M. Herrmann, Christian Brugger, C. Weis, N. Wehn","doi":"10.1145/2818950.2818964","DOIUrl":"https://doi.org/10.1145/2818950.2818964","url":null,"abstract":"Dynamic Random Access Memories (DRAM) have a big impact on performance and contribute significantly to the total power consumption in systems ranging from mobile devices to servers. Up to half of the power consumption of future high density DRAM devices will be caused by refresh commands. Moreover, not only the refresh rate does depend on the device capacity, but it strongly depends on the temperature as well. In case of 3D integration of MPSoCs with Wide I/O DRAMs the power density and thermal dissipation are increased dramatically. Hence, in 3D-DRAM even more DRAM refresh operations are required. To master these challenges, clever DRAM refresh strategies are mandatory either on hardware or on software level using new or already available infrastructures and implementations, such as Partial Array Self Refresh (PASR) or Temperature Compensated Self Refresh (TCSR). In this paper, we show that for dedicated applications refresh can be disabled completely without or with negligible impact on the application performance. This is possible if it is assured that either the lifetime of the data is shorter than the currently required DRAM refresh period or if the application can tolerate bit errors to some degree in a given time window.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129897125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
k-Means Clustering on Two-Level Memory Systems 两级存储系统的k-均值聚类
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818977
M. A. Bender, Jonathan W. Berry, S. Hammond, Branden J. Moore, Benjamin Moseley, C. Phillips
In recent work we quantified the anticipated performance boost when a sorting algorithm is modified to leverage user-addressable "near-memory," which we call scratchpad. This architectural feature is expected in the Intel Knight's Landing processors that will be used in DOE's next large-scale supercomputer. This paper expands our analytical study of the scratchpad to consider k-means clustering, a classical data-analysis technique that is ubiquitous in the literature and in practice. We present new theoretical results using the model introduced in [13], which measures memory transfers and assumes that computations are memory-bound. Our theoretical results indicate that scratchpad-aware versions of k-means clustering can expect performance boosts for high-dimensional instances with relatively few cluster centers. These constraints may limit the practical impact of scratch-pad for k-means acceleration, so we discuss their origins and practical implications. We corroborate our theory with experimental runs on a system instrumented to mimic one with scratchpad memory. We also contribute a semi-formalization of the computational properties that are necessary and sufficient to predict a performance boost from scratchpad-aware variants of algorithms. We have observed and studied these properties in the context of sorting, and now clustering. We conclude with some thoughts on the application of these properties to new areas. Specifically, we believe that dense linear algebra has similar properties to k-means, while sparse linear algebra and FFT computations are more similar to sorting. The sparse operations are more common in scientific computing, so we expect scratchpad to have significant impact in that area.
在最近的工作中,我们量化了在修改排序算法以利用用户可寻址的“近内存”(我们称之为scratchpad)时预期的性能提升。这种架构特性预计将出现在英特尔Knight's Landing处理器上,该处理器将用于美国能源部的下一代大型超级计算机。本文扩展了我们对刮擦板的分析研究,考虑了k-means聚类,这是一种在文献和实践中普遍存在的经典数据分析技术。我们使用[13]中引入的模型提出了新的理论结果,该模型测量内存传输并假设计算是内存限制的。我们的理论结果表明,对于具有相对较少聚类中心的高维实例,具有刮擦板感知版本的k-means聚类可以预期性能提升。这些约束可能会限制刮刮板对k-means加速度的实际影响,因此我们讨论了它们的起源和实际意义。我们通过在一个系统上的实验来证实我们的理论,该系统模拟了一个带有刮擦板存储器的系统。我们还提供了计算属性的半形式化,这些属性对于预测可感知的算法变体的性能提升是必要和充分的。我们已经在排序的背景下观察和研究了这些特性,现在是聚类。最后,对这些特性在新领域的应用提出了一些看法。具体来说,我们认为密集线性代数具有与k-means相似的性质,而稀疏线性代数和FFT计算更类似于排序。稀疏运算在科学计算中更为常见,因此我们期望scratchpad在该领域产生重大影响。
{"title":"k-Means Clustering on Two-Level Memory Systems","authors":"M. A. Bender, Jonathan W. Berry, S. Hammond, Branden J. Moore, Benjamin Moseley, C. Phillips","doi":"10.1145/2818950.2818977","DOIUrl":"https://doi.org/10.1145/2818950.2818977","url":null,"abstract":"In recent work we quantified the anticipated performance boost when a sorting algorithm is modified to leverage user-addressable \"near-memory,\" which we call scratchpad. This architectural feature is expected in the Intel Knight's Landing processors that will be used in DOE's next large-scale supercomputer. This paper expands our analytical study of the scratchpad to consider k-means clustering, a classical data-analysis technique that is ubiquitous in the literature and in practice. We present new theoretical results using the model introduced in [13], which measures memory transfers and assumes that computations are memory-bound. Our theoretical results indicate that scratchpad-aware versions of k-means clustering can expect performance boosts for high-dimensional instances with relatively few cluster centers. These constraints may limit the practical impact of scratch-pad for k-means acceleration, so we discuss their origins and practical implications. We corroborate our theory with experimental runs on a system instrumented to mimic one with scratchpad memory. We also contribute a semi-formalization of the computational properties that are necessary and sufficient to predict a performance boost from scratchpad-aware variants of algorithms. We have observed and studied these properties in the context of sorting, and now clustering. We conclude with some thoughts on the application of these properties to new areas. Specifically, we believe that dense linear algebra has similar properties to k-means, while sparse linear algebra and FFT computations are more similar to sorting. The sparse operations are more common in scientific computing, so we expect scratchpad to have significant impact in that area.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121216289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
MMC: a Many-core Memory Connection Model MMC:多核内存连接模型
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818958
C. Ding, Hao Lu, Chencheng Ye
This extended abstract formulates a model of parallel performance called MMC. It gives the theoretical upper bound of parallel performance based on three factors: the processing capacity, the network capacity, and the memory capacity.
这个扩展的抽象定义了一个称为MMC的并行性能模型。基于处理能力、网络容量和内存容量三个因素给出了并行性能的理论上限。
{"title":"MMC: a Many-core Memory Connection Model","authors":"C. Ding, Hao Lu, Chencheng Ye","doi":"10.1145/2818950.2818958","DOIUrl":"https://doi.org/10.1145/2818950.2818958","url":null,"abstract":"This extended abstract formulates a model of parallel performance called MMC. It gives the theoretical upper bound of parallel performance based on three factors: the processing capacity, the network capacity, and the memory capacity.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134506276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding Energy Aspects of Processing-near-Memory for HPC Workloads 理解HPC工作负载处理-近内存的能量方面
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818985
Hyojong Kim, Hyesoon Kim, S. Yalamanchili, Arun Rodrigues
Interests in the concept of processing-near-memory (PNM) have been reignited with recent improvements of the 3D integration technology. In this work, we analyze the energy consumption characteristics of a system which comprises a conventional processor and a 3D memory stack with fully-programmable cores. We construct a high-level analytical energy model based on the underlying architecture and the technology with which each component is built. From the preliminary experiments with 11 HPC benchmarks from Mantevo benchmark suite, we observed that misses per kilo instructions (MPKI) of last-level cache (LLC) is one of the most important characteristics in determining the friendliness of the application to the PNM execution.
随着最近3D集成技术的改进,对处理-近记忆(PNM)概念的兴趣重新燃起。在这项工作中,我们分析了一个由传统处理器和具有完全可编程内核的3D存储器堆栈组成的系统的能耗特性。我们基于底层架构和构建每个组件所用的技术构建了一个高级分析能量模型。从Mantevo基准测试套件的11个HPC基准测试的初步实验中,我们观察到最后一级缓存(LLC)的每公斤指令缺失(MPKI)是决定应用程序对PNM执行友好性的最重要特征之一。
{"title":"Understanding Energy Aspects of Processing-near-Memory for HPC Workloads","authors":"Hyojong Kim, Hyesoon Kim, S. Yalamanchili, Arun Rodrigues","doi":"10.1145/2818950.2818985","DOIUrl":"https://doi.org/10.1145/2818950.2818985","url":null,"abstract":"Interests in the concept of processing-near-memory (PNM) have been reignited with recent improvements of the 3D integration technology. In this work, we analyze the energy consumption characteristics of a system which comprises a conventional processor and a 3D memory stack with fully-programmable cores. We construct a high-level analytical energy model based on the underlying architecture and the technology with which each component is built. From the preliminary experiments with 11 HPC benchmarks from Mantevo benchmark suite, we observed that misses per kilo instructions (MPKI) of last-level cache (LLC) is one of the most important characteristics in determining the friendliness of the application to the PNM execution.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130869836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A Data Centric Perspective on Memory Placement 以数据为中心的内存放置视角
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818956
Y. Birk, O. Mencer
In this paper, we focus on memory in its role as a channel for passing information from one instruction to another; in particular, in conjunction with spatial or dataflow computing architectures, wherein the computing elements are laid out like an assembly plant. We point out the opportunity to dramatically increase effective data access bandwidth by going from a centralized memory array model with a few ports to numerous tiny buffers that can be accessed concurrently. The penalty is loss in access flexibility, but this flexibility is often a by-product of the memory organization rather than a true need. The improvements in hardware reconfiguration speed and resolution, combined with definition of standard buffer queuing and routing capabilities and efforts by tool designers and application developers are likely to extend the applicability of those architectures, offering dramatic power-cost-performance advantages.
在本文中,我们关注存储器作为将信息从一条指令传递到另一条指令的通道的作用;特别是,结合空间或数据流计算体系结构,其中计算元素像装配厂一样布置。我们指出了通过从具有几个端口的集中式内存阵列模型到可以并发访问的许多微小缓冲区来显着增加有效数据访问带宽的机会。代价是访问灵活性的损失,但这种灵活性通常是内存组织的副产品,而不是真正的需要。硬件重新配置速度和分辨率的改进,加上标准缓冲区队列和路由功能的定义,以及工具设计人员和应用程序开发人员的努力,可能会扩展这些体系结构的适用性,从而提供显著的功率-成本-性能优势。
{"title":"A Data Centric Perspective on Memory Placement","authors":"Y. Birk, O. Mencer","doi":"10.1145/2818950.2818956","DOIUrl":"https://doi.org/10.1145/2818950.2818956","url":null,"abstract":"In this paper, we focus on memory in its role as a channel for passing information from one instruction to another; in particular, in conjunction with spatial or dataflow computing architectures, wherein the computing elements are laid out like an assembly plant. We point out the opportunity to dramatically increase effective data access bandwidth by going from a centralized memory array model with a few ports to numerous tiny buffers that can be accessed concurrently. The penalty is loss in access flexibility, but this flexibility is often a by-product of the memory organization rather than a true need. The improvements in hardware reconfiguration speed and resolution, combined with definition of standard buffer queuing and routing capabilities and efforts by tool designers and application developers are likely to extend the applicability of those architectures, offering dramatic power-cost-performance advantages.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128685486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Modeling Data Movement in the Memory Hierarchy in HPC Systems HPC系统中内存层次结构中的数据移动建模
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818972
Aditya M. Deshpande, J. Draper
Increasing core counts and cache sizes in modern processors are causing data movement across the memory hierarchy to increase. With High Performance Computing (HPC) systems becoming more and more energy constrained, improving energy efficiency is becoming a necessity. Given its significant impact on system energy efficiency, the data movement costs in terms of energy and performance cannot be neglected. Conventional techniques for modeling and analyzing data movement across the memory hierarchy have proven to be inadequate in helping computer architects and system designers to optimize data movement. Our work is a position statement emphasizing the need for more detailed data movement modeling tools that better quantify how data movement across the memory hierarchy during application execution affects energy and performance. The hope is that exposing more detailed characteristics about the data movement would enable designers to optimize applications and architectures for minimizing data movement and in turn reduce energy and perhaps even increase performance.
在现代处理器中不断增加的核心数和缓存大小导致内存层次结构中的数据移动增加。随着高性能计算(HPC)系统越来越受到能源限制,提高能源效率成为一种必要。鉴于其对系统能源效率的重大影响,数据移动在能源和性能方面的成本不容忽视。在帮助计算机架构师和系统设计人员优化数据移动方面,用于跨内存层次结构建模和分析数据移动的传统技术已被证明是不够的。我们的工作是一个立场声明,强调需要更详细的数据移动建模工具,以便更好地量化应用程序执行期间跨内存层次结构的数据移动如何影响能量和性能。我们希望通过揭示更多关于数据移动的详细特征,使设计人员能够优化应用程序和体系结构,以最大限度地减少数据移动,从而减少能耗,甚至提高性能。
{"title":"Modeling Data Movement in the Memory Hierarchy in HPC Systems","authors":"Aditya M. Deshpande, J. Draper","doi":"10.1145/2818950.2818972","DOIUrl":"https://doi.org/10.1145/2818950.2818972","url":null,"abstract":"Increasing core counts and cache sizes in modern processors are causing data movement across the memory hierarchy to increase. With High Performance Computing (HPC) systems becoming more and more energy constrained, improving energy efficiency is becoming a necessity. Given its significant impact on system energy efficiency, the data movement costs in terms of energy and performance cannot be neglected. Conventional techniques for modeling and analyzing data movement across the memory hierarchy have proven to be inadequate in helping computer architects and system designers to optimize data movement. Our work is a position statement emphasizing the need for more detailed data movement modeling tools that better quantify how data movement across the memory hierarchy during application execution affects energy and performance. The hope is that exposing more detailed characteristics about the data movement would enable designers to optimize applications and architectures for minimizing data movement and in turn reduce energy and perhaps even increase performance.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133540195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
HpMC: An Energy-aware Management System of Multi-level Memory Architectures HpMC:一种多级存储器架构的能量感知管理系统
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818974
Chun-Yi Su, D. Roberts, E. León, K. Cameron, B. D. Supinski, G. Loh, Dimitrios S. Nikolopoulos
DRAM technology faces density and power challenges to increase capacity because of limitations of physical cell design. To overcome these limitations, system designers are exploring alternative solutions that combine DRAM and emerging NVRAM technologies. Previous work on heterogeneous memories focuses, mainly, on two system designs: PCache, a hierarchical, inclusive memory system, and HRank, a flat, non-inclusive memory system. We demonstrate that neither of these designs can universally achieve high performance and energy efficiency across a suite of HPC workloads. In this work, we investigate the impact of a number of multi-level memory designs on the performance, power, and energy consumption of applications. To achieve this goal and overcome the limited number of available tools to study heterogeneous memories, we created HMsim, an infrastructure that enables n-level, heterogeneous memory studies by leveraging existing memory simulators. We, then, propose HpMC, a new memory controller design that combines the best aspects of existing management policies to improve performance and energy. Our energy-aware memory management system dynamically switches between PCache and HRank based on the temporal locality of applications. Our results show that HpMC reduces energy consumption from 13% to 45% compared to PCache and HRank, while providing the same bandwidth and higher capacity than a conventional DRAM system.
由于物理单元设计的限制,DRAM技术在提高容量方面面临密度和功率方面的挑战。为了克服这些限制,系统设计人员正在探索结合DRAM和新兴NVRAM技术的替代解决方案。以前对异构存储器的研究主要集中在两种系统设计上:PCache是一种分层的、包容的存储器系统,HRank是一种扁平的、不包容的存储器系统。我们证明,这两种设计都不能在一套HPC工作负载中普遍实现高性能和能效。在这项工作中,我们研究了一些多级存储器设计对应用程序的性能、功率和能耗的影响。为了实现这一目标并克服研究异构内存可用工具数量有限的问题,我们创建了HMsim,这是一种利用现有内存模拟器实现n级异构内存研究的基础设施。然后,我们提出HpMC,一种新的内存控制器设计,结合了现有管理策略的最佳方面,以提高性能和能源。我们的能量感知内存管理系统基于应用程序的时间局域性在PCache和HRank之间动态切换。我们的研究结果表明,与PCache和HRank相比,HpMC将能耗从13%降低到45%,同时提供与传统DRAM系统相同的带宽和更高的容量。
{"title":"HpMC: An Energy-aware Management System of Multi-level Memory Architectures","authors":"Chun-Yi Su, D. Roberts, E. León, K. Cameron, B. D. Supinski, G. Loh, Dimitrios S. Nikolopoulos","doi":"10.1145/2818950.2818974","DOIUrl":"https://doi.org/10.1145/2818950.2818974","url":null,"abstract":"DRAM technology faces density and power challenges to increase capacity because of limitations of physical cell design. To overcome these limitations, system designers are exploring alternative solutions that combine DRAM and emerging NVRAM technologies. Previous work on heterogeneous memories focuses, mainly, on two system designs: PCache, a hierarchical, inclusive memory system, and HRank, a flat, non-inclusive memory system. We demonstrate that neither of these designs can universally achieve high performance and energy efficiency across a suite of HPC workloads. In this work, we investigate the impact of a number of multi-level memory designs on the performance, power, and energy consumption of applications. To achieve this goal and overcome the limited number of available tools to study heterogeneous memories, we created HMsim, an infrastructure that enables n-level, heterogeneous memory studies by leveraging existing memory simulators. We, then, propose HpMC, a new memory controller design that combines the best aspects of existing management policies to improve performance and energy. Our energy-aware memory management system dynamically switches between PCache and HRank based on the temporal locality of applications. Our results show that HpMC reduces energy consumption from 13% to 45% compared to PCache and HRank, while providing the same bandwidth and higher capacity than a conventional DRAM system.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127213158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Shared Last-Level Caches and The Case for Longer Timeslices 共享最后一级缓存和更长的时间片的情况
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818968
Viacheslav V. Fedorov, A. Reddy, Paul V. Gratz
Memory performance is important in modern systems. Contention at various levels in memory hierarchy can lead to significant application performance degradation due to interference. Further, modern, large, last-level caches (LLC) have fill times greater than the OS scheduling window. When several threads are running concurrently and timesharing the CPU cores, they may never be able to load their working sets into the cache before being rescheduled, thus permanently stuck in the "cold-start" regime. We show that by increasing the system scheduling timeslice length it is possible to amortize the cache cold-start penalty due to the multitasking and improve application performance by 10--15%.
在现代系统中,内存性能非常重要。内存层次结构中各个级别上的争用可能会由于干扰导致应用程序性能显著下降。此外,现代的大型最后一级缓存(LLC)的填充时间大于操作系统调度窗口。当多个线程并发运行并分时使用CPU内核时,它们可能永远无法在重新调度之前将其工作集加载到缓存中,从而永久地陷入“冷启动”状态。我们表明,通过增加系统调度时片长度,可以分摊由于多任务处理而导致的缓存冷启动损失,并将应用程序性能提高10- 15%。
{"title":"Shared Last-Level Caches and The Case for Longer Timeslices","authors":"Viacheslav V. Fedorov, A. Reddy, Paul V. Gratz","doi":"10.1145/2818950.2818968","DOIUrl":"https://doi.org/10.1145/2818950.2818968","url":null,"abstract":"Memory performance is important in modern systems. Contention at various levels in memory hierarchy can lead to significant application performance degradation due to interference. Further, modern, large, last-level caches (LLC) have fill times greater than the OS scheduling window. When several threads are running concurrently and timesharing the CPU cores, they may never be able to load their working sets into the cache before being rescheduled, thus permanently stuck in the \"cold-start\" regime. We show that by increasing the system scheduling timeslice length it is possible to amortize the cache cold-start penalty due to the multitasking and improve application performance by 10--15%.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126580304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2015 International Symposium on Memory Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1