首页 > 最新文献

ACM International Conference on Computing Frontiers最新文献

英文 中文
Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures 动态渗透:研究传统优化在多核体系结构中的不足
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212944
E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao
This paper provides a discussion on the shortcomings of traditional static optimization techniques when used in the context of many-core architectures. We argue that these shortcomings are a result of the significantly different environment found in many-cores. We analyze previous attempts at optimization of Dense Matrix Multiplication (DMM) that failed to achieve high performance despite extensive efforts towards optimization. We have found that percolation (prefetching data) and scheduling play a central role in the performance of applications. To overcome those difficulties, we have (1) fused dynamic scheduling and percolation into a dynamic percolation approach and (2) we have added additional percolation operations. Our new techniques enabled us to increase the performance of the application in our study from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in SRAM) or 65.6 GFLOPS (operands in DRAM).
本文讨论了传统静态优化技术在多核架构环境下的缺点。我们认为这些缺点是在多核中发现的显着不同的环境的结果。我们分析了之前在密集矩阵乘法(DMM)优化方面的尝试,尽管在优化方面做了大量的努力,但未能实现高性能。我们发现,渗透(预取数据)和调度在应用程序的性能中起着核心作用。为了克服这些困难,我们(1)将动态调度和渗透融合到一个动态渗透方法中,(2)我们增加了额外的渗透操作。我们的新技术使我们能够将我们研究中的应用程序的性能从44 GFLOPS(可能的80 GFLOPS)提高到70.0 GFLOPS (SRAM中的操作数)或65.6 GFLOPS (DRAM中的操作数)。
{"title":"Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures","authors":"E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao","doi":"10.1145/2212908.2212944","DOIUrl":"https://doi.org/10.1145/2212908.2212944","url":null,"abstract":"This paper provides a discussion on the shortcomings of traditional static optimization techniques when used in the context of many-core architectures. We argue that these shortcomings are a result of the significantly different environment found in many-cores. We analyze previous attempts at optimization of Dense Matrix Multiplication (DMM) that failed to achieve high performance despite extensive efforts towards optimization.\u0000 We have found that percolation (prefetching data) and scheduling play a central role in the performance of applications. To overcome those difficulties, we have (1) fused dynamic scheduling and percolation into a dynamic percolation approach and (2) we have added additional percolation operations. Our new techniques enabled us to increase the performance of the application in our study from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in SRAM) or 65.6 GFLOPS (operands in DRAM).","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131367150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Studying the impact of application-level optimizations on the power consumption of multi-core architectures 研究应用级优化对多核架构功耗的影响
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212927
S. Rahman, Jichi Guo, Akshatha Bhat, Carlos D. Garcia, Majedul Haque Sujon, Qing Yi, C. Liao, D. Quinlan
This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential and multi-threaded benchmarks using varying compiler optimization settings and runtime configurations. Our extensive experimental study provides insights for answering two questions: 1) what degrees of impact can application level optimizations have on reducing the overall system power consumption of modern CMP architectures; and 2) what strategies can compilers and application developers adopt to achieve a balanced performance and power efficiency for applications from a variety of science and embedded systems domains.
本文研究了两种多核架构(8 核英特尔工作站和 32 核 AMD 工作站)的整体系统功耗变化,同时使用这些机器在不同的编译器优化设置和运行时配置下执行各种顺序和多线程基准测试。我们广泛的实验研究为回答两个问题提供了见解:1)应用级优化对降低现代 CMP 架构的整体系统功耗有多大影响;以及 2)编译器和应用开发人员可以采取哪些策略来实现各种科学和嵌入式系统领域的应用在性能和功耗效率之间的平衡。
{"title":"Studying the impact of application-level optimizations on the power consumption of multi-core architectures","authors":"S. Rahman, Jichi Guo, Akshatha Bhat, Carlos D. Garcia, Majedul Haque Sujon, Qing Yi, C. Liao, D. Quinlan","doi":"10.1145/2212908.2212927","DOIUrl":"https://doi.org/10.1145/2212908.2212927","url":null,"abstract":"This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential and multi-threaded benchmarks using varying compiler optimization settings and runtime configurations. Our extensive experimental study provides insights for answering two questions: 1) what degrees of impact can application level optimizations have on reducing the overall system power consumption of modern CMP architectures; and 2) what strategies can compilers and application developers adopt to achieve a balanced performance and power efficiency for applications from a variety of science and embedded systems domains.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130966073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
CRESTA: a software focussed approach to exascale co-design CRESTA:一种以软件为重点的百亿亿级协同设计方法
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212958
Mark I. Parsons
The CRESTA project is one of three complementary exascale software projects funded by the European Commission. The three-year project is employing a novel approach to exascale system co-design which focuses on the use of a small, representative set of applications to inform and guide software and systemware developments. This methodology is designed to identify where problem areas exist in applications and to use that knowledge to consider different solutions to those problems which inform software and hardware advances. CRESTA uses a methodology of either incremental or disruptive advances to move towards solutions across the whole of the exascale software stack.
CRESTA项目是欧洲委员会资助的三个补充性百亿亿级软件项目之一。这个为期三年的项目采用了一种新的方法来进行百亿亿级系统协同设计,重点是使用一组小型的、有代表性的应用程序来通知和指导软件和系统软件的开发。该方法旨在识别应用程序中存在的问题区域,并使用这些知识来考虑解决这些问题的不同解决方案,从而为软件和硬件的进步提供信息。CRESTA采用增量式或颠覆式的方法,在整个exascale软件堆栈中寻求解决方案。
{"title":"CRESTA: a software focussed approach to exascale co-design","authors":"Mark I. Parsons","doi":"10.1145/2212908.2212958","DOIUrl":"https://doi.org/10.1145/2212908.2212958","url":null,"abstract":"The CRESTA project is one of three complementary exascale software projects funded by the European Commission. The three-year project is employing a novel approach to exascale system co-design which focuses on the use of a small, representative set of applications to inform and guide software and systemware developments. This methodology is designed to identify where problem areas exist in applications and to use that knowledge to consider different solutions to those problems which inform software and hardware advances. CRESTA uses a methodology of either incremental or disruptive advances to move towards solutions across the whole of the exascale software stack.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116129236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories DMA循环:一个增强的高级可编程DMA控制器,用于优化片上本地存储器的管理
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212925
Nikola Vujic, Lluc Alvarez, Marc González, X. Martorell, E. Ayguadé
This paper presents DMA-circular, a novel DMA controller for optimized memory management for on-chip local memories. DMA-circular embeds the functionality of caches into the DMA controller and applies aggressive optimizations using novel hardware. DMA-circular anticipates the computation requirements in terms of data transfers and performs buffer management for data that is mapped to the local memory. The explicit hardware support accelerates the most common actions related to the management of a local memory while the cache functionalities enable a high level of programmability for the DMA-circular. The evaluation is done on several high performance kernels from the NAS benchmark suite. Compared to traditional DMA controllers, results show speedups from 1.20x to 2x, keeping the control code overhead under 15% of the kernels' execution time and also reducing the energy consumption up to 40%.
本文提出了一种新的DMA循环控制器,用于优化片上本地存储器的存储器管理。DMA循环将缓存的功能嵌入到DMA控制器中,并使用新颖的硬件应用积极的优化。DMA-circular预测数据传输方面的计算需求,并对映射到本地内存的数据执行缓冲区管理。显式硬件支持加速了与本地内存管理相关的最常见操作,而缓存功能为dma循环提供了高水平的可编程性。评估是在NAS基准测试套件中的几个高性能内核上完成的。与传统的DMA控制器相比,结果显示速度从1.20倍提高到2倍,使控制代码开销保持在内核执行时间的15%以下,并将能耗降低了40%。
{"title":"DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories","authors":"Nikola Vujic, Lluc Alvarez, Marc González, X. Martorell, E. Ayguadé","doi":"10.1145/2212908.2212925","DOIUrl":"https://doi.org/10.1145/2212908.2212925","url":null,"abstract":"This paper presents DMA-circular, a novel DMA controller for optimized memory management for on-chip local memories. DMA-circular embeds the functionality of caches into the DMA controller and applies aggressive optimizations using novel hardware. DMA-circular anticipates the computation requirements in terms of data transfers and performs buffer management for data that is mapped to the local memory. The explicit hardware support accelerates the most common actions related to the management of a local memory while the cache functionalities enable a high level of programmability for the DMA-circular. The evaluation is done on several high performance kernels from the NAS benchmark suite. Compared to traditional DMA controllers, results show speedups from 1.20x to 2x, keeping the control code overhead under 15% of the kernels' execution time and also reducing the energy consumption up to 40%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128208495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards player-driven procedural content generation 转向玩家驱动的程序内容生成
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212942
Noor Shaker, Georgios N. Yannakakis, J. Togelius
Generating immersive game content is one of the ultimate goals for a game designer. This goal can be achieved by realizing the fact that players' perception of the same game differ according to a number of factors including: players' personality, playing styles, expertise and culture background. While one player might find the game immersive, others may quit playing as a result of encountering a seemingly insoluble problem. One promising avenue towards optimizing the gameplay experience for individual game players is to tailor player experience in real-time via automatic game content generation. Specifying the aspects of the game that have the major influence on the gameplay experience, identifying the relationship between these aspect and each individual experience and defining a mechanism for tailoring the game content according to each individual needs are important steps towards player-driven content generation.
生成沉浸式游戏内容是游戏设计师的终极目标之一。要实现这一目标,我们必须认识到,玩家对同一款游戏的看法会因玩家的个性、游戏风格、专业技能和文化背景等因素而有所不同。虽然有些玩家可能会觉得游戏很有沉浸感,但有些玩家可能会因为遇到看似无法解决的问题而退出游戏。优化单个游戏玩家玩法体验的一种可行方法是,通过自动生成游戏内容实时定制玩家体验。明确游戏中对玩法体验有重大影响的方面,确定这些方面与每个个体体验之间的关系,并根据每个个体需求定义定制游戏内容的机制,这些都是实现玩家驱动内容生成的重要步骤。
{"title":"Towards player-driven procedural content generation","authors":"Noor Shaker, Georgios N. Yannakakis, J. Togelius","doi":"10.1145/2212908.2212942","DOIUrl":"https://doi.org/10.1145/2212908.2212942","url":null,"abstract":"Generating immersive game content is one of the ultimate goals for a game designer. This goal can be achieved by realizing the fact that players' perception of the same game differ according to a number of factors including: players' personality, playing styles, expertise and culture background. While one player might find the game immersive, others may quit playing as a result of encountering a seemingly insoluble problem. One promising avenue towards optimizing the gameplay experience for individual game players is to tailor player experience in real-time via automatic game content generation. Specifying the aspects of the game that have the major influence on the gameplay experience, identifying the relationship between these aspect and each individual experience and defining a mechanism for tailoring the game content according to each individual needs are important steps towards player-driven content generation.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134368773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Towards more intelligent adaptive video game agents: a computational intelligence perspective 迈向更智能的自适应电子游戏代理:计算智能视角
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212955
S. Lucas, Philipp Rohlfshagen, Diego Perez Liebana
This paper provides a computational intelligence perspective on the design of intelligent video game agents. The paper explains why this is an interesting area to research, and outlines the most promising approaches to date, including evolution, temporal difference learning and Monte Carlo Tree Search. Strengths and weaknesses of each approach are identified, and some research directions are outlined that may soon lead to significantly improved video game agents with lower development costs.
本文从计算智能的角度对智能电子游戏代理的设计进行了研究。这篇论文解释了为什么这是一个有趣的研究领域,并概述了迄今为止最有前途的方法,包括进化、时间差异学习和蒙特卡罗树搜索。本文指出了每种方法的优缺点,并概述了一些研究方向,这些方向可能很快就会显著改进电子游戏代理,降低开发成本。
{"title":"Towards more intelligent adaptive video game agents: a computational intelligence perspective","authors":"S. Lucas, Philipp Rohlfshagen, Diego Perez Liebana","doi":"10.1145/2212908.2212955","DOIUrl":"https://doi.org/10.1145/2212908.2212955","url":null,"abstract":"This paper provides a computational intelligence perspective on the design of intelligent video game agents. The paper explains why this is an interesting area to research, and outlines the most promising approaches to date, including evolution, temporal difference learning and Monte Carlo Tree Search. Strengths and weaknesses of each approach are identified, and some research directions are outlined that may soon lead to significantly improved video game agents with lower development costs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129624216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SuperCoP: a general, correct, and performance-efficient supervised memory system SuperCoP:一个通用的、正确的、性能高效的监督内存系统
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212922
Bharghava Rajaram, V. Nagarajan, Andrew J. McPherson, Marcelo H. Cintra
Supervised memory systems maintain additional metadata for each memory address accessed by the program, to control and monitor accesses to the program data. Supervised systems find use in several applications including memory checking, synchronization, race detection, and transactional memory. Conventional memory instructions are replaced by supervised memory instructions (SMIs) which operate on both data and metadata atomically. Existing proposals for supervised memory systems assume sequential consistency. Recently, Bobba et al. [4] demonstrated the correctness issues (imprecise exceptions and metadata read reordering) in naively applying supervision to Total-Store-Order, and proposed two solutions - TSOall and TSOdata - for overcoming the correctness issues. TSOall solves correctness issues by forcing SMIs to perform in order, but performs similar to SC, since supervised writes cannot retire into the write-buffer. TSOdata, while allowing supervised writes to retire into the write-buffer, works correctly for only a subset of supervision schemes. In this paper we observe that correctness is ensured as long as SMIs read and process their metadata in order. We propose SuperCoP, a supervised memory system for relaxed memory models in which SMIs read and process metadata before retirement, while allowing data and metadata writes to retire into the write-buffer. Since SuperCoP separates metadata reads and their processing from the writes, we propose a simple mechanism - in the form of cache block level locking at the directory - to ensure atomicity. Our experimental results show that SuperCoP performs better than TSOall by 16.8%. SuperCoP also performs better than TSOdata by 6%, even though TSOdata is not general.
监督存储器系统为程序访问的每个存储器地址维护额外的元数据,以控制和监视对程序数据的访问。监督式系统在许多应用程序中都有应用,包括内存检查、同步、竞争检测和事务性内存。传统的内存指令被监督内存指令(SMIs)所取代,后者自动地对数据和元数据进行操作。现有的有监督存储系统设想了顺序一致性。最近,Bobba等人b[4]证明了在对Total-Store-Order进行天真的监督时存在的正确性问题(不精确异常和元数据读取重排序),并提出了TSOall和TSOdata两种解决方案来克服正确性问题。TSOall通过强制smi按顺序执行来解决正确性问题,但其执行方式与SC类似,因为受监督的写操作不能退入写缓冲区。TSOdata虽然允许有监督的写退出到写缓冲区,但只对一部分监督方案正确工作。在本文中,我们观察到,只要smi按顺序读取和处理其元数据,就可以确保正确性。我们提出了SuperCoP,这是一种用于宽松内存模型的监督内存系统,其中SMIs在退役前读取和处理元数据,同时允许数据和元数据写入退役到写缓冲区。由于SuperCoP将元数据的读取及其处理与写入分离开来,因此我们提出了一种简单的机制——以目录缓存块级锁定的形式——来确保原子性。实验结果表明,SuperCoP的性能比TSOall提高了16.8%。SuperCoP的性能也比TSOdata好6%,尽管TSOdata并不通用。
{"title":"SuperCoP: a general, correct, and performance-efficient supervised memory system","authors":"Bharghava Rajaram, V. Nagarajan, Andrew J. McPherson, Marcelo H. Cintra","doi":"10.1145/2212908.2212922","DOIUrl":"https://doi.org/10.1145/2212908.2212922","url":null,"abstract":"Supervised memory systems maintain additional metadata for each memory address accessed by the program, to control and monitor accesses to the program data. Supervised systems find use in several applications including memory checking, synchronization, race detection, and transactional memory. Conventional memory instructions are replaced by supervised memory instructions (SMIs) which operate on both data and metadata atomically. Existing proposals for supervised memory systems assume sequential consistency. Recently, Bobba et al. [4] demonstrated the correctness issues (imprecise exceptions and metadata read reordering) in naively applying supervision to Total-Store-Order, and proposed two solutions - TSOall and TSOdata - for overcoming the correctness issues. TSOall solves correctness issues by forcing SMIs to perform in order, but performs similar to SC, since supervised writes cannot retire into the write-buffer. TSOdata, while allowing supervised writes to retire into the write-buffer, works correctly for only a subset of supervision schemes. In this paper we observe that correctness is ensured as long as SMIs read and process their metadata in order. We propose SuperCoP, a supervised memory system for relaxed memory models in which SMIs read and process metadata before retirement, while allowing data and metadata writes to retire into the write-buffer. Since SuperCoP separates metadata reads and their processing from the writes, we propose a simple mechanism - in the form of cache block level locking at the directory - to ensure atomicity. Our experimental results show that SuperCoP performs better than TSOall by 16.8%. SuperCoP also performs better than TSOdata by 6%, even though TSOdata is not general.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131851118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies 探索深度非易失性内存层次结构中的延迟-功率权衡
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212923
D. Yoon, T. Gonzalez, Parthasarathy Ranganathan, R. Schreiber
To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The model is based on captured application behavior, an analytical power and performance model, and circuit-level memory models such as CACTI and NVSim. We use the model to explore the cache hierarchy design space and present latency-power tradeoffs for memory intensive SPEC benchmarks and scientific applications. The results indicate that a flattened hierarchy lowers power and improves average memory access time.
为了处理对非常大的主存的需求,我们可能会使用非易失性内存(NVM)作为主存。NVM主存将比DRAM具有更高的延迟。为了解决这个问题,我们提倡基于大型最后一级NVM缓存的不太深的缓存层次结构。我们开发了一个模型来估计缓存层次结构的平均内存访问时间和功率。该模型基于捕获的应用程序行为、分析能力和性能模型以及CACTI和NVSim等电路级内存模型。我们使用该模型来探索缓存层次结构设计空间,并为内存密集型SPEC基准测试和科学应用程序提供延迟功率权衡。结果表明,扁平的层次结构降低了功耗,提高了平均内存访问时间。
{"title":"Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies","authors":"D. Yoon, T. Gonzalez, Parthasarathy Ranganathan, R. Schreiber","doi":"10.1145/2212908.2212923","DOIUrl":"https://doi.org/10.1145/2212908.2212923","url":null,"abstract":"To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The model is based on captured application behavior, an analytical power and performance model, and circuit-level memory models such as CACTI and NVSim. We use the model to explore the cache hierarchy design space and present latency-power tradeoffs for memory intensive SPEC benchmarks and scientific applications. The results indicate that a flattened hierarchy lowers power and improves average memory access time.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134473960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A reconfigurable optical/electrical interconnect architecture for large-scale clusters and datacenters 用于大规模集群和数据中心的可重构光/电互连体系结构
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212913
D. Lugones, K. Katrinis, M. Collier
Hybrid optical/electrical interconnects, using commercially available optical circuit switches at the core part of the network, have been recently proposed as an attractive alternative to fully-connected electronically-switched networks in terms of port density, bandwidth/port, cabling and energy efficiency. Although the shift from a traditionally packet-switched core to switching between server aggregations (or servers) at circuit granularity requires system redesign, the approach has been shown to fit well to the traffic requirements of certain classes of high-performance computing applications, as well as to the traffic patterns exhibited by typical data center workloads. Recent proposals for such system designs have looked at small/medium scale hybrid interconnects. In this paper, we present a hybrid optical/electrical interconnect architecture intended for large-scale deployments of high-performance computing systems and server co-locations. To reduce complexity, our architecture employs a regular shuffle network topology that allows for simple management and cabling. Thanks to using a single-stage core interconnect and multiple optical planes, our design can be both incrementally scaled up (in capacity) and scaled out (in the number of racks) without requiring major re-cabling and network re-configuration. Also, we are the first to our knowledge to explore the benefit of using multi-hopping in the optical domain as a means to avoid constant reconfiguration of optical circuit switches. We have prototyped our architecture at packet-level detail in a simulation framework to evaluate this concept. Our results demonstrate that our hybrid interconnect, by adapting to the changing nature of application traffic, can significantly exceed the throughput of a static interconnect of equal degree, while at times attaining a throughput comparable to that of a costly fully-connected network. We also show a further benefit brought by multi-hopping, that it reduces the performance drops by reducing the frequency of reconfiguration.
混合光/电互连,在网络的核心部分使用商用光电路交换机,最近被提出作为在端口密度、带宽/端口、电缆和能源效率方面替代完全连接的电子交换网络的一种有吸引力的选择。虽然从传统的分组交换核心到在电路粒度上的服务器聚合(或服务器)之间的切换需要重新设计系统,但该方法已被证明非常适合某些类别的高性能计算应用程序的流量需求,以及典型数据中心工作负载所显示的流量模式。最近关于此类系统设计的建议着眼于中小型混合互连。在本文中,我们提出了一种混合光/电互连架构,用于大规模部署高性能计算系统和服务器共址。为了降低复杂性,我们的体系结构采用常规的洗牌网络拓扑,允许简单的管理和布线。由于使用单级核心互连和多个光平面,我们的设计既可以增量扩展(容量),也可以扩展(机架数量),而无需重新布线和重新配置网络。此外,据我们所知,我们是第一个探索在光域中使用多跳的好处,作为避免光电路开关不断重新配置的手段。我们已经在一个模拟框架中对我们的架构进行了包级细节的原型化,以评估这个概念。我们的结果表明,通过适应应用程序流量的变化性质,我们的混合互连可以显著超过同等程度的静态互连的吞吐量,同时有时达到与昂贵的全连接网络相当的吞吐量。我们还展示了多跳带来的另一个好处,即通过减少重新配置的频率来减少性能下降。
{"title":"A reconfigurable optical/electrical interconnect architecture for large-scale clusters and datacenters","authors":"D. Lugones, K. Katrinis, M. Collier","doi":"10.1145/2212908.2212913","DOIUrl":"https://doi.org/10.1145/2212908.2212913","url":null,"abstract":"Hybrid optical/electrical interconnects, using commercially available optical circuit switches at the core part of the network, have been recently proposed as an attractive alternative to fully-connected electronically-switched networks in terms of port density, bandwidth/port, cabling and energy efficiency. Although the shift from a traditionally packet-switched core to switching between server aggregations (or servers) at circuit granularity requires system redesign, the approach has been shown to fit well to the traffic requirements of certain classes of high-performance computing applications, as well as to the traffic patterns exhibited by typical data center workloads. Recent proposals for such system designs have looked at small/medium scale hybrid interconnects. In this paper, we present a hybrid optical/electrical interconnect architecture intended for large-scale deployments of high-performance computing systems and server co-locations. To reduce complexity, our architecture employs a regular shuffle network topology that allows for simple management and cabling. Thanks to using a single-stage core interconnect and multiple optical planes, our design can be both incrementally scaled up (in capacity) and scaled out (in the number of racks) without requiring major re-cabling and network re-configuration. Also, we are the first to our knowledge to explore the benefit of using multi-hopping in the optical domain as a means to avoid constant reconfiguration of optical circuit switches. We have prototyped our architecture at packet-level detail in a simulation framework to evaluate this concept. Our results demonstrate that our hybrid interconnect, by adapting to the changing nature of application traffic, can significantly exceed the throughput of a static interconnect of equal degree, while at times attaining a throughput comparable to that of a costly fully-connected network. We also show a further benefit brought by multi-hopping, that it reduces the performance drops by reducing the frequency of reconfiguration.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127218807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Accelerated high-performance computing through efficient multi-process GPU resource sharing 通过高效的多进程GPU资源共享加速高性能计算
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212950
Teng Li, Vikram K. Narayana, T. El-Ghazawi
The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since each process executes the same program under SPMD, every process mapped to a CPU core also needs the GPU availability. Therefore SPMD demands a symmetric CPU/GPU distribution. However, since modern HPC systems feature a large number of CPU cores that outnumber the number of GPUs, computing resources are generally underutilized with SPMD. Our previous efforts have focused on GPU virtualization that enables efficient sharing of GPU among multiple CPU processes. Nevertheless, a formal method to evaluate and choose the appropriate GPU sharing approach is still lacking. In this paper, based on SPMD GPU kernel profiles, we propose different multi-process GPU sharing scenarios under virtualization. We introduce an analytical model that captures these sharing scenarios and provides a theoretical performance gain estimation. Benchmarks validate our analyses and achievable performance gains. While our analytical study provides a suitable theoretical foundation for GPU sharing, the experimental results demonstrate that GPU virtualization affords significant performance improvements over the non-virtualized solutions for all proposed sharing scenarios.
HPC领域正在见证gpu作为传统同质HPC系统加速器的广泛采用。流行的并行编程模型之一是SPMD范式,它已被用于基于gpu的并行处理。由于每个进程在SPMD下执行相同的程序,因此映射到CPU核心的每个进程也需要GPU可用性。因此,SPMD要求对称的CPU/GPU分布。但是,由于现代HPC系统具有大量的CPU内核,其数量超过了gpu的数量,因此SPMD通常没有充分利用计算资源。我们之前的工作主要集中在GPU虚拟化上,它可以在多个CPU进程之间高效地共享GPU。然而,一个正式的方法来评估和选择适当的GPU共享方法仍然缺乏。本文在SPMD GPU内核配置文件的基础上,提出了虚拟化下不同的多进程GPU共享场景。我们引入了一个分析模型来捕捉这些共享场景,并提供了一个理论上的性能增益估计。基准测试验证了我们的分析和可实现的性能增益。虽然我们的分析研究为GPU共享提供了合适的理论基础,但实验结果表明,在所有提出的共享场景中,GPU虚拟化比非虚拟化解决方案提供了显着的性能改进。
{"title":"Accelerated high-performance computing through efficient multi-process GPU resource sharing","authors":"Teng Li, Vikram K. Narayana, T. El-Ghazawi","doi":"10.1145/2212908.2212950","DOIUrl":"https://doi.org/10.1145/2212908.2212950","url":null,"abstract":"The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since each process executes the same program under SPMD, every process mapped to a CPU core also needs the GPU availability. Therefore SPMD demands a symmetric CPU/GPU distribution. However, since modern HPC systems feature a large number of CPU cores that outnumber the number of GPUs, computing resources are generally underutilized with SPMD. Our previous efforts have focused on GPU virtualization that enables efficient sharing of GPU among multiple CPU processes. Nevertheless, a formal method to evaluate and choose the appropriate GPU sharing approach is still lacking. In this paper, based on SPMD GPU kernel profiles, we propose different multi-process GPU sharing scenarios under virtualization. We introduce an analytical model that captures these sharing scenarios and provides a theoretical performance gain estimation. Benchmarks validate our analyses and achievable performance gains. While our analytical study provides a suitable theoretical foundation for GPU sharing, the experimental results demonstrate that GPU virtualization affords significant performance improvements over the non-virtualized solutions for all proposed sharing scenarios.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"81 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1