2015 International Conference on Parallel Architecture and Compilation (PACT)最新文献_第3页

BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models BSSync:用有界过期一致性模型处理机器学习工作负载的近内存

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.42

Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim

Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times.

并行机器学习工作负载已经在许多应用领域变得普遍。这些工作负载中的许多都是迭代收敛的，允许不同的线程以异步方式进行计算，从而减轻了使用过时值的某些写后读数据依赖关系。虽然通过利用异步并行性来减少节点之间的通信延迟已经付出了相当大的努力，但是在单个节点内对宽松一致性模型的低效利用导致并行实现的执行效率很低。原子操作导致的长延迟和序列化对性能有很大影响。数据通信与主计算不重叠，降低了执行效率。这种低效率来自于数据在存储位置和处理位置之间的移动。在这项工作中，我们提出了有界过时同步(BSSync)，这是对有界过时一致性模型的硬件支持，它伴随着内存层次结构中的简单逻辑层。BSSync将长延迟的原子操作与主计算重叠，针对迭代收敛的机器学习工作负载。与之前允许读操作过期的工作相比，BSSync利用写操作过期，允许写过期。我们证明了所提出的方案对代表性机器学习工作负载的好处。平均而言，我们的方法比基线异步并行实现的性能高出1.33倍。

{"title":"BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models","authors":"Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim","doi":"10.1109/PACT.2015.42","DOIUrl":"https://doi.org/10.1109/PACT.2015.42","url":null,"abstract":"Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121176697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

An Optimization of Resource Arrangement for Network-on-Chip using Genetic Algorithm 基于遗传算法的片上网络资源配置优化

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.54

Daichi Murakami, K. Hiraki

Non-uniform arrangement of on-chip resources could outperform the uniform design. The way of arrangement itself, however, has not been well researched yet. In this research, we proposed an automatic method for better arrangement of the on-chip resources using GA.

片上资源的非均匀排列优于均匀设计。然而，排列方式本身还没有得到很好的研究。在本研究中，我们提出了一种利用遗传算法更好地自动安排片上资源的方法。

引用次数: 0

Compiler Assisted Load Balancing on Large Clusters 大型集群上的编译器辅助负载平衡

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.40

Vinita V. Deodhar, Hrushit Parikh, Ada Gavrilovska, S. Pande

Load balancing of tasks across processing nodes is critical for achieving speed up on large scale clusters. Load balancing schemes typically detect the imbalance and then migrate the load from an overloaded processing node to an idle or lightly loaded processing node and thus, estimation of load critically affects the performance of load balancing schemes. On large scale clusters, the latency of load migration between processing nodes (and the energy) is also a significant overhead and any missteps in load estimation can cause significant migrations and performance losses. Currently, the load estimation is done either by profile or feedback driven approaches in sophisticated systems such as Charm++, but such approaches must be re-thought in light of some workloads such as Adaptive Mesh Refinement (AMR) and multiscale physics where the load variations could be quite dynamic and rapid. In this work we propose a compiler based framework which performs precise prediction of the forthcoming workload. The compiler driven load prediction technique performs static analysis of a task and derives an expression to predict load of a task and hoists it as early as possible in the control flow of execution. The compiler also inserts corrector expressions at strategic program points which refine the reachability probability of the load as well as its estimation. The predictor and the corrector expressions are evaluated at runtime and the predicted load information is refined as the execution proceeds and is eventually used by load balancer to take efficient migration decisions. We present an implementation of the above in the Rose compiler and the Charm++ parallel programming framework. We demonstrate the effectiveness of the framework on some key benchmarks that exhibit dynamic variations and show how the compiler framework assists load balancing schemes in Charm++ to provide significant gains.

跨处理节点的任务负载平衡对于实现大规模集群的速度至关重要。负载均衡方案通常检测不平衡，然后将负载从过载的处理节点迁移到空闲或轻负载的处理节点，因此，负载估计对负载均衡方案的性能有重要影响。在大规模集群中，处理节点之间负载迁移的延迟(和能量)也是一项重大开销，负载估计中的任何失误都可能导致重大迁移和性能损失。目前，负载估计是通过轮廓或反馈驱动的方法在复杂的系统，如Charm++中完成的，但这些方法必须重新考虑一些工作负载，如自适应网格细化(AMR)和多尺度物理，其中负载变化可能非常动态和快速。在这项工作中，我们提出了一个基于编译器的框架，它可以精确预测即将到来的工作量。编译器驱动的负载预测技术对任务进行静态分析，推导出一个表达式来预测任务的负载，并在执行控制流中尽早将其提升。编译器还在策略程序点插入校正表达式，以改进负载的可达性概率及其估计。预测器和校正器表达式在运行时计算，预测的负载信息在执行过程中得到细化，并最终被负载平衡器用于做出有效的迁移决策。我们在Rose编译器和charm++并行编程框架中实现了上述功能。我们在展示动态变化的一些关键基准上展示了框架的有效性，并展示了编译器框架如何帮助Charm++中的负载平衡方案提供显著的收益。

{"title":"Compiler Assisted Load Balancing on Large Clusters","authors":"Vinita V. Deodhar, Hrushit Parikh, Ada Gavrilovska, S. Pande","doi":"10.1109/PACT.2015.40","DOIUrl":"https://doi.org/10.1109/PACT.2015.40","url":null,"abstract":"Load balancing of tasks across processing nodes is critical for achieving speed up on large scale clusters. Load balancing schemes typically detect the imbalance and then migrate the load from an overloaded processing node to an idle or lightly loaded processing node and thus, estimation of load critically affects the performance of load balancing schemes. On large scale clusters, the latency of load migration between processing nodes (and the energy) is also a significant overhead and any missteps in load estimation can cause significant migrations and performance losses. Currently, the load estimation is done either by profile or feedback driven approaches in sophisticated systems such as Charm++, but such approaches must be re-thought in light of some workloads such as Adaptive Mesh Refinement (AMR) and multiscale physics where the load variations could be quite dynamic and rapid. In this work we propose a compiler based framework which performs precise prediction of the forthcoming workload. The compiler driven load prediction technique performs static analysis of a task and derives an expression to predict load of a task and hoists it as early as possible in the control flow of execution. The compiler also inserts corrector expressions at strategic program points which refine the reachability probability of the load as well as its estimation. The predictor and the corrector expressions are evaluated at runtime and the predicted load information is refined as the execution proceeds and is eventually used by load balancer to take efficient migration decisions. We present an implementation of the above in the Rose compiler and the Charm++ parallel programming framework. We demonstrate the effectiveness of the framework on some key benchmarks that exhibit dynamic variations and show how the compiler framework assists load balancing schemes in Charm++ to provide significant gains.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115568154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Runtime-Guided Management of Scratchpad Memories in Multicore Architectures 多核架构中刮本存储器的运行时引导管理

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.26

Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, X. Martorell, Jesús Labarta, E. Ayguadé, M. Valero

The increasing number of cores and the anticipated level of heterogeneity in upcoming multicore architectures cause important problems in traditional cache hierarchies. A good way to alleviate these problems is to add scratchpad memories alongside the cache hierarchy, forming a hybrid memory hierarchy. This memory organization has the potential to improve performance and to reduce the power consumption and the on-chip network traffic, but exposing such a complex memory model to the programmer has a very negative impact on the programmability of the architecture. Emerging task-based programming models are a promising alternative to program heterogeneous multicore architectures. In these models the runtime system manages the execution of the tasks on the architecture, allowing them to apply many optimizations in a generic way at the runtime system level. This paper proposes giving the runtime system the responsibility to manage the scratchpad memories of a hybrid memory hierarchy in multicore processors, transparently to the programmer. In the envisioned system, the runtime system takes advantage of the information found in the task dependences to map the inputs and outputs of a task to the scratchpad memory of the core that is going to execute it. In addition, the paper exploits two mechanisms to overlap the data transfers with computation and a locality-aware scheduler to reduce the data motion. In a 32-core multicore architecture, the hybrid memory hierarchy outperforms cache-only hierarchies by up to 16%, reduces on-chip network traffic by up to 31% and saves up to 22% of the consumed power.

在即将到来的多核体系结构中，越来越多的核心数量和预期的异构水平会给传统的缓存层次结构带来重要问题。缓解这些问题的一个好方法是在缓存层次结构旁边添加临时存储器，形成混合内存层次结构。这种内存组织有可能提高性能，降低功耗和片上网络流量，但是向程序员公开如此复杂的内存模型对体系结构的可编程性有非常负面的影响。新兴的基于任务的编程模型是程序异构多核体系结构的一个很有前途的替代方案。在这些模型中，运行时系统管理体系结构上任务的执行，允许它们在运行时系统级别以通用的方式应用许多优化。本文建议让运行时系统负责管理多核处理器中混合内存层次的刮板内存，对程序员透明。在设想的系统中，运行时系统利用在任务依赖项中找到的信息，将任务的输入和输出映射到将要执行它的核心的暂存存储器。此外，本文还利用两种机制将数据传输与计算重叠，并利用位置感知调度器减少数据移动。在32核多核架构中，混合内存层次比仅缓存层次的性能高出16%，减少片上网络流量高达31%，节省功耗高达22%。

{"title":"Runtime-Guided Management of Scratchpad Memories in Multicore Architectures","authors":"Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, X. Martorell, Jesús Labarta, E. Ayguadé, M. Valero","doi":"10.1109/PACT.2015.26","DOIUrl":"https://doi.org/10.1109/PACT.2015.26","url":null,"abstract":"The increasing number of cores and the anticipated level of heterogeneity in upcoming multicore architectures cause important problems in traditional cache hierarchies. A good way to alleviate these problems is to add scratchpad memories alongside the cache hierarchy, forming a hybrid memory hierarchy. This memory organization has the potential to improve performance and to reduce the power consumption and the on-chip network traffic, but exposing such a complex memory model to the programmer has a very negative impact on the programmability of the architecture. Emerging task-based programming models are a promising alternative to program heterogeneous multicore architectures. In these models the runtime system manages the execution of the tasks on the architecture, allowing them to apply many optimizations in a generic way at the runtime system level. This paper proposes giving the runtime system the responsibility to manage the scratchpad memories of a hybrid memory hierarchy in multicore processors, transparently to the programmer. In the envisioned system, the runtime system takes advantage of the information found in the task dependences to map the inputs and outputs of a task to the scratchpad memory of the core that is going to execute it. In addition, the paper exploits two mechanisms to overlap the data transfers with computation and a locality-aware scheduler to reduce the data motion. In a 32-core multicore architecture, the hybrid memory hierarchy outperforms cache-only hierarchies by up to 16%, reduces on-chip network traffic by up to 31% and saves up to 22% of the consumed power.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131360426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Evaluating the Cost of Atomic Operations on Modern Architectures 评估现代体系结构中原子操作的成本

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.24

Hermann Schweizer, Maciej Besta, T. Hoefler

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.

原子操作(原子)，如比较与交换(CAS)或获取与添加(FAA)，在并行编程中无处不在。然而，这些操作和这些系统的各种特征(比如缓存的结构)之间的性能权衡并不清楚，也没有得到彻底的分析。在本文中，我们建立了一种评估方法，开发了一个性能模型，并为不同原子的延迟和带宽提供了一组详细的基准测试。我们考虑了各种最先进的x86架构:英特尔Haswell, Xeon Phi, Ivy Bridge和AMD推土机。结果揭示了所考虑的原子和架构属性(如访问的缓存线的相干状态)之间令人惊讶的性能关系。一个关键的发现是，所有被测试的原子都具有相似的延迟和带宽，即使它们具有不同的共识数。另一个观点是，原子的设计阻止了任何指令级并行性，即使发出的操作之间没有依赖关系。最后，我们讨论了在所分析的体系结构中发现的性能问题的解决方案。我们的分析可用于在部署在现成机器和大型计算系统中的各种架构上的并行编程中做出更好的设计和算法决策。

{"title":"Evaluating the Cost of Atomic Operations on Modern Architectures","authors":"Hermann Schweizer, Maciej Besta, T. Hoefler","doi":"10.1109/PACT.2015.24","DOIUrl":"https://doi.org/10.1109/PACT.2015.24","url":null,"abstract":"Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"86 26 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Vector Parallelism in JavaScript: Language and Compiler Support for SIMD JavaScript中的向量并行:SIMD的语言和编译器支持

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.33

Ivan Jibaja, P. Jensen, Ningxin Hu, M. Haghighat, J. McCutchan, D. Gohman, S. Blackburn, K. McKinley

JavaScript is the most widely used web programming language and is increasingly used to implement sophisticated and demanding applications in such domains as graphics, games, video, and cryptography. The performance and energy usage of these applications can benefit from hardware parallelism, including SIMD (Single Instruction, Multiple Data) vector parallel instructions. JavaScript's current support for parallelism is, however, limited and does not directly utilize SIMD capabilities. This paper presents the design and implementation of SIMD language extensions and compiler support that together add fine-grain vector parallelism to JavaScript. The specification for this language extension is in final stages of adoption by the JavaScript standardization committee and our compiler support is available in two open-source production browsers. The design principles seek portability, SIMD performance portability on various SIMD architectures, and compiler simplicity to ease adoption. The design does not require automatic vectorization compiler technology, but does not preclude it either. The SIMD extensions define immutable fixed-length SIMD data types and operations that are common to both ARM and x86 ISAs. The contributions of this work include type speculation and optimizations that generate minimal numbers of SIMD native instructions from high-level JavaScript SIMD instructions. We implement type speculation, optimizations, and code generation in two open-source JavaScript VMs and measure performance improvements between a factor of 1.7× to 8.9× with an average of 3.3× and average energy improvements of 2.9× on micro benchmarks and key graphics kernels on various hardware, browsers, and operating systems. These portable SIMD language extensions significantly improve compute-intensive interactive applications in the browser, such as games and media processing, by exploiting vector parallelism without relying on automatic vectorizing compiler technology, non-portable native code, or plugins.

JavaScript是使用最广泛的web编程语言，并且越来越多地用于在图形、游戏、视频和密码学等领域实现复杂和苛刻的应用程序。这些应用程序的性能和能源使用可以受益于硬件并行性，包括SIMD(单指令，多数据)矢量并行指令。然而，JavaScript当前对并行性的支持是有限的，并且没有直接利用SIMD功能。本文介绍了SIMD语言扩展和编译器支持的设计和实现，它们共同为JavaScript增加了细粒度向量并行性。该语言扩展的规范正处于JavaScript标准化委员会采用的最后阶段，我们的编译器支持在两个开源产品浏览器中可用。设计原则寻求可移植性，在各种SIMD体系结构上的SIMD性能可移植性，以及简化编译器以便于采用。该设计不需要自动向量化编译器技术，但也不排斥自动向量化编译器技术。SIMD扩展定义了不可变的固定长度的SIMD数据类型和操作，它们对于ARM和x86 isa都是通用的。这项工作的贡献包括类型推测和从高级JavaScript SIMD指令生成最少数量的SIMD本地指令的优化。我们在两个开源JavaScript虚拟机中实现了类型推测、优化和代码生成，并在各种硬件、浏览器和操作系统上的微基准测试和关键图形内核上测量了1.7到8.9倍的性能改进，平均为3.3倍，平均能量改进为2.9倍。这些可移植的SIMD语言扩展通过利用向量并行性，而不依赖于自动向量化编译器技术、不可移植的本机代码或插件，显著改善了浏览器中计算密集型的交互式应用程序，例如游戏和媒体处理。

{"title":"Vector Parallelism in JavaScript: Language and Compiler Support for SIMD","authors":"Ivan Jibaja, P. Jensen, Ningxin Hu, M. Haghighat, J. McCutchan, D. Gohman, S. Blackburn, K. McKinley","doi":"10.1109/PACT.2015.33","DOIUrl":"https://doi.org/10.1109/PACT.2015.33","url":null,"abstract":"JavaScript is the most widely used web programming language and is increasingly used to implement sophisticated and demanding applications in such domains as graphics, games, video, and cryptography. The performance and energy usage of these applications can benefit from hardware parallelism, including SIMD (Single Instruction, Multiple Data) vector parallel instructions. JavaScript's current support for parallelism is, however, limited and does not directly utilize SIMD capabilities. This paper presents the design and implementation of SIMD language extensions and compiler support that together add fine-grain vector parallelism to JavaScript. The specification for this language extension is in final stages of adoption by the JavaScript standardization committee and our compiler support is available in two open-source production browsers. The design principles seek portability, SIMD performance portability on various SIMD architectures, and compiler simplicity to ease adoption. The design does not require automatic vectorization compiler technology, but does not preclude it either. The SIMD extensions define immutable fixed-length SIMD data types and operations that are common to both ARM and x86 ISAs. The contributions of this work include type speculation and optimizations that generate minimal numbers of SIMD native instructions from high-level JavaScript SIMD instructions. We implement type speculation, optimizations, and code generation in two open-source JavaScript VMs and measure performance improvements between a factor of 1.7× to 8.9× with an average of 3.3× and average energy improvements of 2.9× on micro benchmarks and key graphics kernels on various hardware, browsers, and operating systems. These portable SIMD language extensions significantly improve compute-intensive interactive applications in the browser, such as games and media processing, by exploiting vector parallelism without relying on automatic vectorizing compiler technology, non-portable native code, or plugins.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131062174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems 避免通信的算法:并行系统的分析和代码生成

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.41

K. Murthy, J. Mellor-Crummey

Data movement is a critical bottleneck for future generations of parallel systems. The class of .5D communication-avoiding algorithms were developed to address this bottleneck. These algorithms reduce communication and provide strong scaling in both time and energy. As a firststep towards automating the development of communication-avoiding-libraries, we developed the Maunam compiler. Maunam generates efficient parallel code from a high-level, global view sketch of .5D algorithms that are expressed using symbolic data sizes and numbers of processors. It supports the expression of data movement and communication through-high-level global operations such as TILT and CSHIFT as well as through element-wise copy operations. With the latter, wrap around communication patterns can also be achieved using subscripts based on modulo operations. Maunam employs polyhedral analysis to reason about communication and computation present in a .5D algorithm. After partitioning data and computation, it inserts point-to-point-and collective communication as needed. Maunam also analyzes data dependence patterns and data layouts to identify reductions over processor subsets. Maunam-generated Fortran+MPI code for 2.5D matrix multiplication running on 4096 cores of a Cray XC30 super computer achieves 59 TFlops/s (76% of the machine peak). Our generated parallel code achieves 91% of the performance of a hand-coded version.

数据移动是未来几代并行系统的关键瓶颈。为了解决这一瓶颈，开发了一类0.5 d通信避免算法。这些算法减少了通信，并在时间和精力上提供了强大的可伸缩性。作为自动化开发避免通信库的第一步，我们开发了Maunam编译器。Maunam通过使用符号数据大小和处理器数量表示的。5 d算法的高级全局视图草图生成高效的并行代码。它支持通过高级全局操作(如TILT和CSHIFT)以及通过元素复制操作来表达数据移动和通信。对于后者，还可以使用基于模操作的下标来实现封装通信模式。Maunam使用多面体分析来解释0.5 d算法中存在的通信和计算。在划分数据和计算后，根据需要插入点对点和集体通信。Maunam还分析了数据依赖模式和数据布局，以确定处理器子集的减少。maunam生成的用于2.5D矩阵乘法的Fortran+MPI代码在Cray XC30超级计算机的4096个核上运行，达到59 TFlops/s(机器峰值的76%)。我们生成的并行代码达到了手工编码版本的91%的性能。

{"title":"Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems","authors":"K. Murthy, J. Mellor-Crummey","doi":"10.1109/PACT.2015.41","DOIUrl":"https://doi.org/10.1109/PACT.2015.41","url":null,"abstract":"Data movement is a critical bottleneck for future generations of parallel systems. The class of .5D communication-avoiding algorithms were developed to address this bottleneck. These algorithms reduce communication and provide strong scaling in both time and energy. As a firststep towards automating the development of communication-avoiding-libraries, we developed the Maunam compiler. Maunam generates efficient parallel code from a high-level, global view sketch of .5D algorithms that are expressed using symbolic data sizes and numbers of processors. It supports the expression of data movement and communication through-high-level global operations such as TILT and CSHIFT as well as through element-wise copy operations. With the latter, wrap around communication patterns can also be achieved using subscripts based on modulo operations. Maunam employs polyhedral analysis to reason about communication and computation present in a .5D algorithm. After partitioning data and computation, it inserts point-to-point-and collective communication as needed. Maunam also analyzes data dependence patterns and data layouts to identify reductions over processor subsets. Maunam-generated Fortran+MPI code for 2.5D matrix multiplication running on 4096 cores of a Cray XC30 super computer achieves 59 TFlops/s (76% of the machine peak). Our generated parallel code achieves 91% of the performance of a hand-coded version.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132323424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Dealing with the Unknown: Resilience to Prediction Errors 处理未知:预测错误的弹性

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.19

S. Mitra, G. Bronevetsky, Suhas Javagal, S. Bagchi

Accurate prediction of applications' performance and functional behavior is a critical component for a widerange of tools, including anomaly detection, task scheduling and approximate computing. Statistical modeling is a very powerful approach for making such predictions and it uses observations of application behavior on a small number of training cases to predict how the application will behave in practice. However, the fact that applications' behavior often depends closely on their configuration parameters and properties of their inputs means that any suite of application training runs will cover only a small fraction of its overall behavior space. Since a model's accuracy often degrades as application configuration and inputs deviate further from its training set, this makes it difficult to act based on the model's predictions. This paper presents a systematic approach to quantify theprediction errors of the statistical models of the application behavior, focusing on extrapolation, where the application configuration and input parameters differ significantly from the model's training set. Given any statistical model of application behavior and a data set of training application runs from which this model is built, our technique predicts the accuracy of the model for predicting application behavior on a new run on hitherto unseen inputs. We validate the utility of this method by evaluating it on the use case of anomaly detection for seven mainstream applications and benchmarks. The evaluation demonstrates that our technique can reduce false alarms while providing high detection accuracy compared to a statistical, input-unaware modeling technique.

准确预测应用程序的性能和功能行为是各种工具的关键组成部分，包括异常检测，任务调度和近似计算。统计建模是进行此类预测的一种非常强大的方法，它使用对少量训练案例的应用程序行为的观察来预测应用程序在实践中的行为。然而，应用程序的行为通常密切依赖于它们的配置参数和输入的属性，这意味着任何一套应用程序训练运行将只覆盖其整体行为空间的一小部分。由于模型的准确性通常会随着应用程序配置和输入进一步偏离其训练集而降低，这使得很难根据模型的预测采取行动。本文提出了一种系统的方法来量化应用程序行为统计模型的预测误差，重点是外推，其中应用程序配置和输入参数与模型的训练集有很大不同。给定任何应用程序行为的统计模型和构建该模型的训练应用程序运行的数据集，我们的技术可以预测模型的准确性，以预测迄今为止未见过的输入上新运行的应用程序行为。我们通过对七个主流应用程序和基准的异常检测用例进行评估来验证该方法的实用性。评估表明，与统计的、不知道输入的建模技术相比，我们的技术可以减少误报，同时提供高检测精度。

{"title":"Dealing with the Unknown: Resilience to Prediction Errors","authors":"S. Mitra, G. Bronevetsky, Suhas Javagal, S. Bagchi","doi":"10.1109/PACT.2015.19","DOIUrl":"https://doi.org/10.1109/PACT.2015.19","url":null,"abstract":"Accurate prediction of applications' performance and functional behavior is a critical component for a widerange of tools, including anomaly detection, task scheduling and approximate computing. Statistical modeling is a very powerful approach for making such predictions and it uses observations of application behavior on a small number of training cases to predict how the application will behave in practice. However, the fact that applications' behavior often depends closely on their configuration parameters and properties of their inputs means that any suite of application training runs will cover only a small fraction of its overall behavior space. Since a model's accuracy often degrades as application configuration and inputs deviate further from its training set, this makes it difficult to act based on the model's predictions. This paper presents a systematic approach to quantify theprediction errors of the statistical models of the application behavior, focusing on extrapolation, where the application configuration and input parameters differ significantly from the model's training set. Given any statistical model of application behavior and a data set of training application runs from which this model is built, our technique predicts the accuracy of the model for predicting application behavior on a new run on hitherto unseen inputs. We validate the utility of this method by evaluating it on the use case of anomaly detection for seven mainstream applications and benchmarks. The evaluation demonstrates that our technique can reduce false alarms while providing high detection accuracy compared to a statistical, input-unaware modeling technique.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121513595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain? 固态硬盘上的存储整合:并不总是灵丹妙药，但我们能减轻痛苦吗?

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.61

Narges Shahidi, Anand Sivasubramanian, M. Kandemir, C. Das

Storage Consolidation is increasing being adopted to reduce system costs, simplify the storage infrastructure, and enhance availability and resource management. However, consolidation leads to interference in shared resources. This poster shows the effect of consolidation on performance of co-located applications and proposes a static approach to reduce interference.

存储整合被越来越多地用于降低系统成本、简化存储基础设施、增强可用性和资源管理。然而，整合导致了对共享资源的干扰。这张海报展示了合并对同址应用程序性能的影响，并提出了一种减少干扰的静态方法。

引用次数: 0

Practical Near-Data Processing for In-Memory Analytics Frameworks 内存分析框架的实用近数据处理

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.22

Mingyu Gao, Grant Ayers, C. Kozyrakis

The end of Dennard scaling has made all systemsenergy-constrained. For data-intensive applications with limitedtemporal locality, the major energy bottleneck is data movementbetween processor chips and main memory modules. For such workloads, the best way to optimize energy is to place processing near the datain main memory. Advances in 3D integrationprovide an opportunity to implement near-data processing (NDP) withoutthe technology problems that similar efforts had in the past. This paper develops the hardware and software of an NDP architecturefor in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks. We develop simple but scalablehardware support for coherence, communication, and synchronization, anda runtime system that is sufficient to support analytics frameworks withcomplex data patterns while hiding all thedetails of the NDP hardware. Our NDP architecture provides up to 16x performance and energy advantageover conventional approaches, and 2.5x over recently-proposed NDP systems. We also investigate the balance between processing and memory throughput, as well as the scalability and physical and logical organization of the memory system. Finally, we show that it is critical to optimize software frameworksfor spatial locality as it leads to 2.9x efficiency improvements for NDP.

登纳德缩放的终结使得所有系统的能量都受到限制。对于时间局部性有限的数据密集型应用，主要的能量瓶颈是处理器芯片和主存储模块之间的数据移动。对于这样的工作负载，优化能量的最佳方法是将处理放在数据主内存附近。3D集成的进步为实现近数据处理(NDP)提供了机会，而没有过去类似努力所存在的技术问题。本文开发了内存分析框架的NDP架构的硬件和软件，包括MapReduce，图形处理和深度神经网络。我们开发了简单但可扩展的硬件支持，用于一致性，通信和同步，以及运行时系统，足以支持具有复杂数据模式的分析框架，同时隐藏了NDP硬件的所有细节。我们的NDP架构比传统方法提供高达16倍的性能和能源优势，比最近提出的NDP系统提供2.5倍。我们还研究了处理和内存吞吐量之间的平衡，以及内存系统的可伸缩性和物理和逻辑组织。最后，我们表明，优化空间局部性的软件框架至关重要，因为它会导致NDP的效率提高2.9倍。

{"title":"Practical Near-Data Processing for In-Memory Analytics Frameworks","authors":"Mingyu Gao, Grant Ayers, C. Kozyrakis","doi":"10.1109/PACT.2015.22","DOIUrl":"https://doi.org/10.1109/PACT.2015.22","url":null,"abstract":"The end of Dennard scaling has made all systemsenergy-constrained. For data-intensive applications with limitedtemporal locality, the major energy bottleneck is data movementbetween processor chips and main memory modules. For such workloads, the best way to optimize energy is to place processing near the datain main memory. Advances in 3D integrationprovide an opportunity to implement near-data processing (NDP) withoutthe technology problems that similar efforts had in the past. This paper develops the hardware and software of an NDP architecturefor in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks. We develop simple but scalablehardware support for coherence, communication, and synchronization, anda runtime system that is sufficient to support analytics frameworks withcomplex data patterns while hiding all thedetails of the NDP hardware. Our NDP architecture provides up to 16x performance and energy advantageover conventional approaches, and 2.5x over recently-proposed NDP systems. We also investigate the balance between processing and memory throughput, as well as the scalability and physical and logical organization of the memory system. Finally, we show that it is critical to optimize software frameworksfor spatial locality as it leads to 2.9x efficiency improvements for NDP.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 237