ACM International Conference on Computing Frontiers最新文献_第5页

Reuse distance based performance modeling and workload mapping 重用基于距离的性能建模和工作负载映射

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212936

Sai Prashanth Muralidhara, M. Kandemir, Orhan Kislal

Modern multicore architectures have multiple cores connected to a hierarchical cache structure resulting in heterogeneity in cache sharing across different subsets of cores. In these systems, overall throughput and efficiency depends heavily on a careful mapping of applications to available cores. In this paper, we study the problem of application-to-core mapping with the goal of trying to improve the overall cache performance in the presence of a hierarchical multi-level cache structure. We propose to sample the memory access patterns of individual applications and build their reuse distance distributions. Further, we propose to use these reuse distance distributions to compute an application-to-core mapping that tries to improve the overall cache performance, and consequently, the overall throughput. We show that our proposed mapping scheme is very effective in practice yielding throughput benefits of about 39% over the worst case mapping and about 30% over the default operating system based mapping. We believe, as larger chip multiprocessors with deeper cache hierarchies are projected to be the norm in the future, efficient mapping of applications to cores will become a vital requirement to extract the maximum possible performance from these systems.

现代多核体系结构将多个核连接到分层缓存结构，从而导致不同核子集之间缓存共享的异构性。在这些系统中，总体吞吐量和效率在很大程度上取决于应用程序到可用核心的仔细映射。在本文中，我们研究了应用到核心的映射问题，目的是试图在分层的多级缓存结构中提高整体缓存性能。我们建议对单个应用程序的内存访问模式进行采样，并构建它们的重用距离分布。此外，我们建议使用这些重用距离分布来计算应用程序到核心的映射，该映射试图提高总体缓存性能，从而提高总体吞吐量。我们表明，我们提出的映射方案在实践中非常有效，比最坏情况下的映射提高了39%的吞吐量，比基于默认操作系统的映射提高了30%。我们相信，随着具有更深缓存层次的更大的芯片多处理器预计将成为未来的标准，将应用程序有效地映射到核心将成为从这些系统中提取最大可能性能的重要要求。

{"title":"Reuse distance based performance modeling and workload mapping","authors":"Sai Prashanth Muralidhara, M. Kandemir, Orhan Kislal","doi":"10.1145/2212908.2212936","DOIUrl":"https://doi.org/10.1145/2212908.2212936","url":null,"abstract":"Modern multicore architectures have multiple cores connected to a hierarchical cache structure resulting in heterogeneity in cache sharing across different subsets of cores. In these systems, overall throughput and efficiency depends heavily on a careful mapping of applications to available cores. In this paper, we study the problem of application-to-core mapping with the goal of trying to improve the overall cache performance in the presence of a hierarchical multi-level cache structure. We propose to sample the memory access patterns of individual applications and build their reuse distance distributions. Further, we propose to use these reuse distance distributions to compute an application-to-core mapping that tries to improve the overall cache performance, and consequently, the overall throughput. We show that our proposed mapping scheme is very effective in practice yielding throughput benefits of about 39% over the worst case mapping and about 30% over the default operating system based mapping. We believe, as larger chip multiprocessors with deeper cache hierarchies are projected to be the norm in the future, efficient mapping of applications to cores will become a vital requirement to extract the maximum possible performance from these systems.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130310935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

How AI can change the way we play games AI将如何改变我们玩游戏的方式

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212956

Kenneth O. Stanley

While artificial intelligence (AI) in games is often associated with enhancing the behavior of non-player characters, at its cutting edge AI offers the potential for entirely new kinds of gaming experiences. In this talk I will focus on this frontier of AI in games through three examples of games from my research that are not only enhanced by AI, but would not even be possible without the unique AI techniques behind them. In these experimental games, called NERO, Galactic Arms Race, and Petalz, players become teachers, AI creates its own content, and unique creations are explicitly bred and traded by the players themselves. The discussion will focus on the inspiration for the technologies behind these games (including some related applications) and the long-term implications of unique and creative AI algorithms for gaming.

虽然游戏中的人工智能(AI)通常与增强非玩家角色的行为有关，但其尖端AI提供了全新游戏体验的潜力。在这次演讲中，我将通过我的研究中的三个游戏例子来关注AI在游戏中的前沿，这些游戏不仅通过AI得到增强，而且如果没有独特的AI技术，它们甚至不可能实现。在这些名为《NERO》、《Galactic Arms Race》和《Petalz》的实验性游戏中，玩家成为老师，AI创造自己的内容，独特的创造是由玩家自己培育和交易的。讨论将集中在这些游戏背后的技术(包括一些相关应用)的灵感，以及独特和创造性的AI算法对游戏的长期影响。

引用次数: 0

Towards truly integrated photonic and electronic computing 迈向光子和电子计算的真正整合

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212910

M. McLaren

The long heralded transition of photonic technology from a rack to rack interconnect to an integral part of the system architecture is underway. Silicon photonics, where the optical communications devices are fabricated using the same materials and processes as CMOS logic, will allow 3D or monolithically integrated devices to be created, minimizing the overhead for moving between the electronic and photonic domains. System architects will then be free to exploit the unique characteristics of photonic communications such as broadband switching and distance independence. Photonic interconnects are very sensitive to the performance of connectors, and so may favor architectures where redundancy and reconfiguration are used in preference to replacement.

光子技术从机架到机架互连到系统架构的一个组成部分的长期转变正在进行中。硅光子学，其中光通信设备是使用与CMOS逻辑相同的材料和工艺制造的，将允许创建3D或单片集成设备，最大限度地减少在电子和光子域之间移动的开销。系统架构师将可以自由地利用光子通信的独特特性，如宽带交换和距离无关性。光子互连对连接器的性能非常敏感，因此可能更倾向于使用冗余和重新配置而不是替换的架构。

引用次数: 0

DEEP: an exascale prototype architecture based on a flexible configuration DEEP:基于灵活配置的百亿亿级原型架构

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212960

A. Bode

DEEP is a multipartner international cooperation project supported by the EU FP7 that introduces a flexible global system architecture using general purpose and manycore processor architectures (based on IntelMIC: many integrated core architecture). With XTOLL, DEEP uses a very powerful interconnection structure, which allows for the arrangement of different application oriented ratios between general purpose processor and accelerator. The project includes research and development on program technologies, tools, applications, and looks at energy efficient computing methodologies.

DEEP是一个由欧盟FP7支持的多伙伴国际合作项目，它引入了一个灵活的全球系统架构，使用通用和多核处理器架构(基于IntelMIC:多集成核心架构)。对于XTOLL, DEEP使用非常强大的互连结构，允许在通用处理器和加速器之间安排不同的面向应用程序的比率。该项目包括对程序技术、工具、应用程序的研究和开发，并着眼于节能计算方法。

引用次数: 0

BSArc: blacksmith streaming architecture for HPC accelerators BSArc: HPC加速器的铁匠流架构

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212914

M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains. In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.

当前高性能计算(HPC)系统的趋势是部署配备通用多核处理器和可能的多核流加速器的并行计算机。然而，这些多核的性能经常受到有限的外部带宽或不匹配的数据访问模式的限制。后者减少了内存事务期间有用数据的大小。应用程序算法的更改可以改善内存访问，但是在内存层次结构中为应用程序特定的数据安排提供硬件支持机制可以显著提高许多应用程序域的性能。在这项工作中，我们提出了一个名为BSArc (Blacksmith Streaming architecture)的概念计算架构。BSArc引入了一个锻造前端，以有效地将数据分发到后端大量简单的流处理器。我们将此概念应用于SIMT执行模型，并在具有可重构应用程序特定前端的类gpu流架构的上下文中提出了设计空间探索。这些设计空间探索是在模拟BSArc的流架构模拟器上进行的。我们针对类似gpu的设备中的标准L2缓存评估了BSArc设计的性能优势。在我们的评估中，我们使用三个应用程序内核:2D-FFT，矩阵-矩阵乘法和3D-Stencil。结果表明，与类似gpu的流媒体设备的标准缓存相比，在这些内核上使用特定于应用程序的数据安排可以实现2.3倍的平均加速。

{"title":"BSArc: blacksmith streaming architecture for HPC accelerators","authors":"M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé","doi":"10.1145/2212908.2212914","DOIUrl":"https://doi.org/10.1145/2212908.2212914","url":null,"abstract":"The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains.\u0000 In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125588631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Game AI revisited 重新审视游戏AI

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212954

Georgios N. Yannakakis

More than a decade after the early research efforts on the use of artificial intelligence (AI) in computer games and the establishment of a new AI domain the term ``game AI'' needs to be redefined. Traditionally, the tasks associated with game AI revolved around non player character (NPC) behavior at different levels of control, varying from navigation and pathfinding to decision making. Commercial-standard games developed over the last 15 years and current game productions, however, suggest that the traditional challenges of game AI have been well addressed via the use of sophisticated AI approaches, not necessarily following or inspired by advances in academic practices. The marginal penetration of traditional academic game AI methods in industrial productions has been mainly due to the lack of constructive communication between academia and industry in the early days of academic game AI, and the inability of academic game AI to propose methods that would significantly advance existing development processes or provide scalable solutions to real world problems. Recently, however, there has been a shift of research focus as the current plethora of AI uses in games is breaking the non-player character AI tradition. A number of those alternative AI uses have already shown a significant potential for the design of better games. This paper presents four key game AI research areas that are currently reshaping the research roadmap in the game AI field and evidently put the game AI term under a new perspective. These game AI flagship research areas include the computational modeling of player experience, the procedural generation of content, the mining of player data on massive-scale and the alternative AI research foci for enhancing NPC capabilities.

在电脑游戏中使用人工智能(AI)的早期研究工作和新AI领域的建立十多年后，“游戏AI”一词需要重新定义。传统上，与游戏AI相关的任务围绕着非玩家角色(NPC)在不同控制水平上的行为，从导航和寻径到决策制定。然而，过去15年开发的商业标准游戏和当前的游戏产品表明，游戏AI的传统挑战已经通过使用复杂的AI方法得到了很好的解决，而不一定是遵循或受到学术实践的启发。传统的学术游戏AI方法在工业生产中的渗透很小，主要是因为在学术游戏AI的早期，学术界和产业界之间缺乏建设性的沟通，并且学术游戏AI无法提出能够显著推进现有开发过程或为现实世界问题提供可扩展解决方案的方法。然而最近，随着游戏中大量AI的使用打破了非玩家角色AI的传统，研究焦点发生了转变。许多AI的替代用途已经显示出设计更好游戏的巨大潜力。本文提出了四个关键的游戏人工智能研究领域，它们正在重塑游戏人工智能领域的研究路线图，显然将游戏人工智能术语置于一个新的视角下。这些游戏AI的主要研究领域包括玩家体验的计算建模、内容的程序生成、大规模的玩家数据挖掘以及提高NPC能力的AI研究重点。

{"title":"Game AI revisited","authors":"Georgios N. Yannakakis","doi":"10.1145/2212908.2212954","DOIUrl":"https://doi.org/10.1145/2212908.2212954","url":null,"abstract":"More than a decade after the early research efforts on the use of artificial intelligence (AI) in computer games and the establishment of a new AI domain the term ``game AI'' needs to be redefined. Traditionally, the tasks associated with game AI revolved around non player character (NPC) behavior at different levels of control, varying from navigation and pathfinding to decision making. Commercial-standard games developed over the last 15 years and current game productions, however, suggest that the traditional challenges of game AI have been well addressed via the use of sophisticated AI approaches, not necessarily following or inspired by advances in academic practices. The marginal penetration of traditional academic game AI methods in industrial productions has been mainly due to the lack of constructive communication between academia and industry in the early days of academic game AI, and the inability of academic game AI to propose methods that would significantly advance existing development processes or provide scalable solutions to real world problems. Recently, however, there has been a shift of research focus as the current plethora of AI uses in games is breaking the non-player character AI tradition. A number of those alternative AI uses have already shown a significant potential for the design of better games.\u0000 This paper presents four key game AI research areas that are currently reshaping the research roadmap in the game AI field and evidently put the game AI term under a new perspective. These game AI flagship research areas include the computational modeling of player experience, the procedural generation of content, the mining of player data on massive-scale and the alternative AI research foci for enhancing NPC capabilities.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115463508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 189

Concurrent hybrid switching for massively parallel systems-on-chip: the CYBER architecture 大规模并行片上系统的并发混合交换:CYBER体系结构

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212933

F. Palumbo, D. Pani, A. Congiu, L. Raffo

Massively Parallel Systems-on-chip represent the new frontier of integrated computing systems for general purpose computing. The integration of a huge number of cores poses several issues such as the efficiency and flexibility of the interconnection network in order to serve in the best way the different traffic patterns that can arise. In this paper we present the CYBER architecture, an advanced Network-on-Chip (NoC) for concurrent hybrid switching with prioritized best effort Quality of Service. Compared to similar architectures, CYBER allows the simultaneous exploitation of packet switching and circuit switching, providing two different priorities to packets in order to be able to transmit urgent messages (e.g. signalling) while long-lasting transactions and huge packets congestion are present. In terms of the typical NoC metrics, evaluated on synthetic traffic representative of several application categories, their standard trend is degraded while serving both circuit and packet switching simultaneously but the architecture preserves a predictable behaviour. A CMOS 90nm implementation reveals a maximum operating frequency of about 1GHz.

大规模并行片上系统代表了通用计算集成计算系统的新前沿。大量核心的集成带来了一些问题，例如互连网络的效率和灵活性，以便以最佳方式为可能出现的不同流量模式提供服务。在本文中，我们提出了CYBER架构，一种先进的片上网络(NoC)，用于并发混合交换，具有优先的最佳努力服务质量。与类似的架构相比，CYBER允许同时利用分组交换和电路交换，为分组提供两种不同的优先级，以便能够在长时间的事务和巨大的分组拥塞存在时传输紧急消息(例如信令)。就典型的NoC指标而言，在几种应用类别的合成流量代表上进行评估，它们的标准趋势在同时服务电路和分组交换时降级，但体系结构保留了可预测的行为。CMOS 90nm实现显示最大工作频率约为1GHz。

{"title":"Concurrent hybrid switching for massively parallel systems-on-chip: the CYBER architecture","authors":"F. Palumbo, D. Pani, A. Congiu, L. Raffo","doi":"10.1145/2212908.2212933","DOIUrl":"https://doi.org/10.1145/2212908.2212933","url":null,"abstract":"Massively Parallel Systems-on-chip represent the new frontier of integrated computing systems for general purpose computing. The integration of a huge number of cores poses several issues such as the efficiency and flexibility of the interconnection network in order to serve in the best way the different traffic patterns that can arise.\u0000 In this paper we present the CYBER architecture, an advanced Network-on-Chip (NoC) for concurrent hybrid switching with prioritized best effort Quality of Service. Compared to similar architectures, CYBER allows the simultaneous exploitation of packet switching and circuit switching, providing two different priorities to packets in order to be able to transmit urgent messages (e.g. signalling) while long-lasting transactions and huge packets congestion are present. In terms of the typical NoC metrics, evaluated on synthetic traffic representative of several application categories, their standard trend is degraded while serving both circuit and packet switching simultaneously but the architecture preserves a predictable behaviour. A CMOS 90nm implementation reveals a maximum operating frequency of about 1GHz.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114501396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Mesh independent loop fusion for unstructured mesh applications 非结构化网格应用的网格独立环路融合

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212917

C. Bertolli, A. Betts, P. Kelly, G. Mudalige, M. Giles

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation. In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.

基于非结构化网格的应用程序通常是计算密集型的，导致运行时间长。原则上，最先进的硬件，如多核cpu和多核gpu，可以用于它们的加速，但这些深奥的架构需要专门的知识才能实现最佳性能。OP2是一个并行编程层，它试图通过允许程序员通过API调用(即所谓的OP2-loop)对非结构化网格中的元素进行并行迭代来减轻这种编程负担。OP2编译器基础结构然后使用源到源转换来实现每个OP2循环的并行实现，并发现优化的机会。在本文中，我们描述了几种编译器技术如何有效地利用串联来提高非结构化网格应用程序的性能。特别是，我们展示了由于控制流图的大小而经常被抑制的整个程序分析如何由于OP2编程模型而变得可行，从而促进了积极的优化。我们随后展示了整个程序分析如何成为op2循环优化的推动者。在此基础上，我们展示了如何在编译时定义经典技术，即循环融合，这通常很难应用于非结构化网格应用程序。我们研究了其应用的局限性，并在计算流体动力学应用基准上展示了实验结果，评估了环路融合带来的性能增益。

{"title":"Mesh independent loop fusion for unstructured mesh applications","authors":"C. Bertolli, A. Betts, P. Kelly, G. Mudalige, M. Giles","doi":"10.1145/2212908.2212917","DOIUrl":"https://doi.org/10.1145/2212908.2212917","url":null,"abstract":"Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation.\u0000 In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132022786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

The boat hull model: enabling performance prediction for parallel computing prior to code development 船体模型:在代码开发之前实现并行计算的性能预测

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212937

C. Nugteren, H. Corporaal

Multi-core and many-core were already major trends for the past six years and are expected to continue for the next decade. With these trends of parallel computing, it becomes increasingly difficult to decide on which processor to run a given application, mainly because the programming of these processors has become increasingly challenging. In this work, we present a model to predict the performance of a given application on a multi-core or many-core processor. Since programming these processors can be challenging and time consuming, our model does not require source code to be available for the target processor. This is in contrast to existing performance prediction techniques such as mathematical models and simulators, which require code to be available and optimized for the target architecture. To enable performance prediction prior to algorithm implementation, we classify algorithms using an existing algorithm classification. For each class, we create a specific instance of the roofline model, resulting in a new class-specific model. This new model, named the boat hull model, enables performance prediction and processor selection prior to the development of architecture specific code. We demonstrate the boat hull model using GPUs and CPUs as target architectures. We show that performance is accurately predicted for an example real-life application.

在过去的六年里，多核和多核已经成为主要的趋势，并且预计在未来的十年里还会继续下去。随着并行计算的这些趋势，决定运行给定应用程序的处理器变得越来越困难，主要是因为这些处理器的编程变得越来越具有挑战性。在这项工作中，我们提出了一个模型来预测给定应用程序在多核或多核处理器上的性能。由于对这些处理器进行编程是具有挑战性和耗时的，因此我们的模型不要求目标处理器可以使用源代码。这与现有的性能预测技术(如数学模型和模拟器)形成对比，后者要求代码可用并针对目标体系结构进行优化。为了在算法实现之前进行性能预测，我们使用现有的算法分类对算法进行分类。对于每个类，我们创建一个特定的rooline模型实例，从而产生一个新的特定于类的模型。这个新模型被命名为船体模型，可以在开发特定架构代码之前进行性能预测和处理器选择。我们使用gpu和cpu作为目标架构来演示船体模型。我们展示了对一个示例实际应用程序的性能进行了准确预测。

{"title":"The boat hull model: enabling performance prediction for parallel computing prior to code development","authors":"C. Nugteren, H. Corporaal","doi":"10.1145/2212908.2212937","DOIUrl":"https://doi.org/10.1145/2212908.2212937","url":null,"abstract":"Multi-core and many-core were already major trends for the past six years and are expected to continue for the next decade. With these trends of parallel computing, it becomes increasingly difficult to decide on which processor to run a given application, mainly because the programming of these processors has become increasingly challenging.\u0000 In this work, we present a model to predict the performance of a given application on a multi-core or many-core processor. Since programming these processors can be challenging and time consuming, our model does not require source code to be available for the target processor. This is in contrast to existing performance prediction techniques such as mathematical models and simulators, which require code to be available and optimized for the target architecture.\u0000 To enable performance prediction prior to algorithm implementation, we classify algorithms using an existing algorithm classification. For each class, we create a specific instance of the roofline model, resulting in a new class-specific model. This new model, named the boat hull model, enables performance prediction and processor selection prior to the development of architecture specific code.\u0000 We demonstrate the boat hull model using GPUs and CPUs as target architectures. We show that performance is accurately predicted for an example real-life application.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131571201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A hierachical configuration system for a massively parallel neural hardware platform 面向大规模并行神经硬件平台的分层配置系统

ACM International Conference on Computing Frontiers

Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212934

F. Galluppi, Sergio Davies, Alexander D. Rast, T. Sharp, L. Plana, S. Furber

Simulation of large networks of neurons is a powerful and increasingly prominent methodology for investigate brain functions and structures. Dedicated parallel hardware is a natural candidate for simulating the dynamic activity of many non-linear units communicating asynchronously. It is only scientifically useful, however, if the simulation tools can be configured and run easily and quickly. We present a method to map network models to computational nodes on the SpiNNaker system, a programmable parallel neurally-inspired hardware architecture, by exploiting the hierarchies built in the model. This PArtitioning and Configuration MANager (PACMAN) system supports arbitrary network topologies and arbitrary membrane potential and synapse dynamics, and (most importantly) decouples the model from the device, allowing a variety of languages (PyNN, Nengo, etc.) to drive the simulation hardware. Model representation operates on a Population/Projection level rather than a single-neuron and connection level, exploiting hierarchical properties to lower the complexity of allocating resources and mapping the model onto the system. PACMAN can be thus be used to generate structures coming from different models and front-ends, either with a host-based process, or by parallelising it on the SpiNNaker machine itself to speed up the generation process greatly. We describe the approach with a first implementation of the framework used to configure the current generation of SpiNNaker machines and present results from a set of key benchmarks. The system allows researchers to exploit dedicated simulation hardware which may otherwise be difficult to program. In effect, PACMAN provides automated hardware acceleration for some commonly used network simulators while also pointing towards the advantages of hierarchical configuration for large, domain-specific hardware systems.

模拟大型神经元网络是研究大脑功能和结构的一种强大且日益突出的方法。专用的并行硬件是模拟许多非线性单元异步通信的动态活动的自然候选。然而，只有在仿真工具可以轻松快速地配置和运行的情况下，它才具有科学意义。我们提出了一种将网络模型映射到SpiNNaker系统上的计算节点的方法，SpiNNaker系统是一种可编程的并行神经启发硬件架构，通过利用模型中构建的层次结构。这个分区和配置管理器(PACMAN)系统支持任意网络拓扑和任意膜电位和突触动态，并且(最重要的是)将模型与设备解耦，允许各种语言(PyNN, Nengo等)来驱动仿真硬件。模型表示在Population/Projection级别上操作，而不是在单个神经元和连接级别上操作，利用层次属性来降低分配资源和将模型映射到系统的复杂性。因此，PACMAN可以用来生成来自不同模型和前端的结构，要么是基于主机的过程，要么是通过在SpiNNaker机器上并行化来大大加快生成过程。我们用配置当前一代SpiNNaker机器的框架的第一个实现来描述这种方法，并从一组关键基准中给出结果。该系统允许研究人员利用专用的仿真硬件，否则可能难以编程。实际上，PACMAN为一些常用的网络模拟器提供了自动硬件加速，同时也指出了对大型特定于领域的硬件系统进行分层配置的优势。

{"title":"A hierachical configuration system for a massively parallel neural hardware platform","authors":"F. Galluppi, Sergio Davies, Alexander D. Rast, T. Sharp, L. Plana, S. Furber","doi":"10.1145/2212908.2212934","DOIUrl":"https://doi.org/10.1145/2212908.2212934","url":null,"abstract":"Simulation of large networks of neurons is a powerful and increasingly prominent methodology for investigate brain functions and structures. Dedicated parallel hardware is a natural candidate for simulating the dynamic activity of many non-linear units communicating asynchronously. It is only scientifically useful, however, if the simulation tools can be configured and run easily and quickly. We present a method to map network models to computational nodes on the SpiNNaker system, a programmable parallel neurally-inspired hardware architecture, by exploiting the hierarchies built in the model. This PArtitioning and Configuration MANager (PACMAN) system supports arbitrary network topologies and arbitrary membrane potential and synapse dynamics, and (most importantly) decouples the model from the device, allowing a variety of languages (PyNN, Nengo, etc.) to drive the simulation hardware. Model representation operates on a Population/Projection level rather than a single-neuron and connection level, exploiting hierarchical properties to lower the complexity of allocating resources and mapping the model onto the system. PACMAN can be thus be used to generate structures coming from different models and front-ends, either with a host-based process, or by parallelising it on the SpiNNaker machine itself to speed up the generation process greatly. We describe the approach with a first implementation of the framework used to configure the current generation of SpiNNaker machines and present results from a set of key benchmarks. The system allows researchers to exploit dedicated simulation hardware which may otherwise be difficult to program. In effect, PACMAN provides automated hardware acceleration for some commonly used network simulators while also pointing towards the advantages of hierarchical configuration for large, domain-specific hardware systems.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114774521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69