2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献_第2页

Towards Communication Profile, Topology and Node Failure Aware Process Placement 面向通信概要、拓扑和节点故障感知过程的放置

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00041

Ioannis Vardas, Manolis Ploumidis, M. Marazakis

HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.

高性能计算系统需要不断扩大规模，以满足对高水平能力和容量不断增长的需求，通常在紧迫的时间窗口内进行紧急计算。然而，增加高性能计算系统的规模、复杂性和异构性也会增加系统故障的风险和影响，从而导致资源浪费和作业中断。影响作业完成时间的一个主要因素是进程间通信的成本。为了解决性能和能源效率问题，先前的一些研究针对通信局部性的改进。为了实现这一目标，他们以一种降低通信成本的方式导出MPI进程到系统节点的映射。然而，这种方法忽略了系统故障的影响。在这项工作中，我们提出了一种MPI作业的资源分配方法，同时考虑了高性能和错误弹性。我们的方法，命名为通信配置文件、拓扑和节点故障(CPTF)，考虑了应用程序的通信配置文件、系统拓扑和节点故障概率，以便将作业进程分配给节点。我们通过模拟两个MPI应用程序来评估CPTF的变体，一个具有规则通信模式(LAMMPS)，一个具有不规则通信模式(NPB-DT)。在这两种情况下，与Slurm的默认资源分配策略相比，努力避免容易发生故障的节点和通信路径的CPTF变体完成作业批处理的时间更短。它也显示出最低的工作流产率。NPB-DT的批完成时间平均改善67%，LAMMPS的批完成时间平均改善34%。

{"title":"Towards Communication Profile, Topology and Node Failure Aware Process Placement","authors":"Ioannis Vardas, Manolis Ploumidis, M. Marazakis","doi":"10.1109/SBAC-PAD49847.2020.00041","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00041","url":null,"abstract":"HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127903331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

High Performance and Portable Convolution Operators for Multicore Processors 用于多核处理器的高性能和便携式卷积算子

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00023

Pablo San Juan, Adrián Castelló, M. F. Dolz, P. Alonso-Jordá, E. S. Quintana‐Ortí

The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the IM2COL transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.

卷积神经网络对许多人工智能任务的巨大影响导致了这种类型网络中存在的各种高性能卷积算子算法的发展。其中一种方法利用IM2COL转换和一般矩阵乘法(GEMM)，以便利用许多线性代数库中GEMM内核的高度优化实现。这种方法的主要问题是:1)存储IM2COL变换生成的中间矩阵需要很大的存储空间;2)执行IM2COL变换的时间，这对于复杂的神经网络来说是不可忽略的。本文提出了一种基于GEMM内核的BLIS实现的便携式高性能卷积算法，利用BLIS结构避免了中间内存的使用。此外，该算法消除了显式IM2COL转换的成本，同时保持了GEMM在BLIS中底层实现的可移植性和性能。

{"title":"High Performance and Portable Convolution Operators for Multicore Processors","authors":"Pablo San Juan, Adrián Castelló, M. F. Dolz, P. Alonso-Jordá, E. S. Quintana‐Ortí","doi":"10.1109/SBAC-PAD49847.2020.00023","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00023","url":null,"abstract":"The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the IM2COL transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"2002 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131372439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Controlling Garbage Collection and Request Admission to Improve Performance of FaaS Applications 控制垃圾收集和请求准入以提高FaaS应用程序的性能

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00033

David Quaresma, Daniel Fireman, T. Pereira

Runtime environments like Java's JRE, .NET's CLR, and Ruby's MRI, are popular choices for cloud-based applications and particularly in the Function as a Service (FaaS) serverless computing context. A critical component of runtime environments of these languages is the garbage collector (GC). The GC frees developers from manual memory management, which could potentially ease development and avoid bugs. The benefits of using the GC come with a negative impact on performance; that impact happens because either the GC needs to pause the runtime execution or competes with the running program for computational resources. In this work, we evaluated the usage of a technique - Garbage Collector Control Interceptor (GCI) - that eliminates the negative impact of GC on performance by controlling GC executions and transparently shedding requests while the collections are happening. We executed experiments simulating AWS Lambda's behavior and found that GCI is a viable solution. It benefited the user by improving the response time up to 10.86% at 99.9th percentile and reducing cost by 7.22%, but it also helped the platform provider by improving resource utilization by 14.52%.

运行时环境，如Java的JRE、。net的CLR和Ruby的MRI，是基于云的应用程序的流行选择，特别是在函数即服务(FaaS)无服务器计算环境中。这些语言运行时环境的一个关键组件是垃圾收集器(GC)。GC将开发人员从手动内存管理中解放出来，这可能会简化开发并避免错误。使用GC的好处伴随着对性能的负面影响;这种影响的发生是因为GC需要暂停运行时执行，或者与正在运行的程序竞争计算资源。在这项工作中，我们评估了一种技术的使用情况——垃圾收集器控制拦截器(Garbage Collector Control Interceptor, GCI)——它通过控制GC执行和在收集发生时透明地释放请求来消除GC对性能的负面影响。我们执行了模拟AWS Lambda行为的实验，发现GCI是一个可行的解决方案。它使用户受益，响应时间提高了10.86%(99.9百分位)，成本降低了7.22%，但它也帮助平台提供商提高了14.52%的资源利用率。

{"title":"Controlling Garbage Collection and Request Admission to Improve Performance of FaaS Applications","authors":"David Quaresma, Daniel Fireman, T. Pereira","doi":"10.1109/SBAC-PAD49847.2020.00033","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00033","url":null,"abstract":"Runtime environments like Java's JRE, .NET's CLR, and Ruby's MRI, are popular choices for cloud-based applications and particularly in the Function as a Service (FaaS) serverless computing context. A critical component of runtime environments of these languages is the garbage collector (GC). The GC frees developers from manual memory management, which could potentially ease development and avoid bugs. The benefits of using the GC come with a negative impact on performance; that impact happens because either the GC needs to pause the runtime execution or competes with the running program for computational resources. In this work, we evaluated the usage of a technique - Garbage Collector Control Interceptor (GCI) - that eliminates the negative impact of GC on performance by controlling GC executions and transparently shedding requests while the collections are happening. We executed experiments simulating AWS Lambda's behavior and found that GCI is a viable solution. It benefited the user by improving the response time up to 10.86% at 99.9th percentile and reducing cost by 7.22%, but it also helped the platform provider by improving resource utilization by 14.52%.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Hardware Multiversioning for Fail-Operational Multithreaded Applications 故障操作多线程应用程序的硬件多版本控制

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00014

Rico Amslinger, Christian Piatka, Florian Haas, Sebastian Weis, T. Ungerer, S. Altmeyer

Modern safety-critical embedded applications like autonomous driving need to be fail-operational. At the same time, high performance and low power consumption are demanded. A common way to achieve this is the use of heterogeneous multi-cores. When applied to such systems, prevalent fault tolerance mechanisms suffer from some disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others (e.g. lockstep) require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism introduced by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores. Each thread is automatically split into transactions, which then execute redundantly. The hardware transactional memory is extended to support multiple versions, which allows the reproduction of atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores.

现代安全关键型嵌入式应用，如自动驾驶，需要具备故障操作能力。同时，要求高性能、低功耗。实现这一目标的常见方法是使用异构多核。当应用于这样的系统时，普遍的容错机制存在一些缺点:一些(例如三重模块冗余)需要大量的重复，从而导致高硬件成本和功耗。其他(例如lockstep)需要补充检查点机制来从错误中恢复。进一步的方法(例如基于软件的进程级冗余)无法处理多线程执行带来的不确定性。本文提出了一种基于硬件事务存储器的故障操作系统的新方法，该方法也可用于运行异构多核的嵌入式系统。每个线程被自动分割成事务，然后以冗余方式执行。硬件事务性内存被扩展为支持多个版本，这允许再现原子操作并在发生错误时进行恢复。在我们基于fpga的评估中，我们在12核上执行了具有容错性的PARSEC基准测试套件。

{"title":"Hardware Multiversioning for Fail-Operational Multithreaded Applications","authors":"Rico Amslinger, Christian Piatka, Florian Haas, Sebastian Weis, T. Ungerer, S. Altmeyer","doi":"10.1109/SBAC-PAD49847.2020.00014","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00014","url":null,"abstract":"Modern safety-critical embedded applications like autonomous driving need to be fail-operational. At the same time, high performance and low power consumption are demanded. A common way to achieve this is the use of heterogeneous multi-cores. When applied to such systems, prevalent fault tolerance mechanisms suffer from some disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others (e.g. lockstep) require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism introduced by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores. Each thread is automatically split into transactions, which then execute redundantly. The hardware transactional memory is extended to support multiple versions, which allows the reproduction of atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124541331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Optimized Transactional Data Structure Approach to Concurrency Control for In-Memory Databases 内存数据库并发控制的优化事务性数据结构方法

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00025

Christina L. Peterson, Amalee Wilson, P. Pirkelbauer, D. Dechev

The optimistic concurrency control (OCC) utilized by in-memory databases performs writes on thread-local copies and makes the writes visible upon passing validation. However, high contention workloads suffer from failure of the validation step due to non-semantic memory access conflicts, leading to frequent transaction aborts. In this work, we improve the commit rate of in-memory databases by replacing OCC and the underlying indexing of key-value entries in the Silo database with a lock-free transactional dictionary. To further optimize the transactional commit rate, we present transactional merging, a technique that relaxes the semantic conflict resolution of transactional data structures by merging conflicting operations to reduce aborts. Transactional merging guarantees strict serializability through a strategy that recovers the correct abstract state given that a transaction attempting to merge operations aborts. The experimental evaluation demonstrates that the lock-free transactional dictionary with transactional merging achieves an average speedup of 175% over OCC and the Masstree indexing used in the Silo database for write-dominated workloads on a non-uniform memory access system.

内存数据库使用的乐观并发控制(OCC)对线程本地副本执行写操作，并在通过验证后使写操作可见。然而，由于非语义内存访问冲突，高争用工作负载会导致验证步骤失败，从而导致频繁的事务中止。在这项工作中，我们通过使用无锁事务字典替换OCC和Silo数据库中键值条目的底层索引来提高内存数据库的提交率。为了进一步优化事务提交率，我们提出了事务合并，这是一种通过合并冲突操作来减少事务数据结构的语义冲突解决的技术。事务合并通过一种策略保证严格的序列化性，该策略在尝试合并操作的事务终止时恢复正确的抽象状态。实验评估表明，在非统一内存访问系统上，对于写为主的工作负载，使用无锁事务合并的事务性字典比使用OCC和mastree索引的Silo数据库平均提速175%。

{"title":"Optimized Transactional Data Structure Approach to Concurrency Control for In-Memory Databases","authors":"Christina L. Peterson, Amalee Wilson, P. Pirkelbauer, D. Dechev","doi":"10.1109/SBAC-PAD49847.2020.00025","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00025","url":null,"abstract":"The optimistic concurrency control (OCC) utilized by in-memory databases performs writes on thread-local copies and makes the writes visible upon passing validation. However, high contention workloads suffer from failure of the validation step due to non-semantic memory access conflicts, leading to frequent transaction aborts. In this work, we improve the commit rate of in-memory databases by replacing OCC and the underlying indexing of key-value entries in the Silo database with a lock-free transactional dictionary. To further optimize the transactional commit rate, we present transactional merging, a technique that relaxes the semantic conflict resolution of transactional data structures by merging conflicting operations to reduce aborts. Transactional merging guarantees strict serializability through a strategy that recovers the correct abstract state given that a transaction attempting to merge operations aborts. The experimental evaluation demonstrates that the lock-free transactional dictionary with transactional merging achieves an average speedup of 175% over OCC and the Masstree indexing used in the Silo database for write-dominated workloads on a non-uniform memory access system.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122872461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Re-evaluation of Atomic Operations and Graph Coloring for Unstructured Finite Volume GPU Simulations 非结构化有限体积GPU模拟的原子运算和图着色的再评价

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00048

Xi Zhang, Xu Sun, Xiaohu Guo, Yunfei Du, Yutong Lu, Yang Liu

In general, race condition can be resolved by introducing synchronisations or breaking data dependencies. Atomic operations and graph coloring are the two typical approaches to avoid race condition. Graph coloring algorithms have been generally considered winning algorithms in the literature due to their lock free implementations. In this paper, we present the GPU-accelerated algorithms of the unstructured cell-centered finite volume Computational Fluid Dynamics (CFD) software framework named PHengLEI which was originally developed for aerodynamics applications with arbitrary hybrid meshes. Overall, the newly developed GPU framework demonstrate up to 4.8 speedup comparing with 18 MPI tasks run on the latest Intel CPU node. Furthermore, the enormous efforts have been invested to optimize data dependencies which could lead to race condition due to unstructured mesh indirect addressing and related reduction math operations. With careful comparison between our optimised graph coloring and atomic operations using a series of numerical tests with different mesh sizes, the results show that atomic operations are more efficient than our optimised graph coloring in all of the test cases on Nvidia Tesla GPU V100. Specifically, for the summation operation, using atomicAdd is twice as fast as graph coloring. For the maximum operation, a speedup of 1.5 to 2 is found for atomicMax vs. graph coloring.

一般来说，竞争条件可以通过引入同步或打破数据依赖来解决。原子操作和图着色是避免竞争条件的两种典型方法。图着色算法由于其无锁实现而在文献中被普遍认为是获胜算法。本文介绍了非结构化单元中心有限体积计算流体动力学(CFD)软件框架PHengLEI的gpu加速算法，该框架最初是为任意混合网格的空气动力学应用而开发的。总体而言，与在最新英特尔CPU节点上运行的18个MPI任务相比，新开发的GPU框架显示出高达4.8的加速。此外，由于非结构化网格间接寻址和相关的简化数学操作，数据依赖关系可能导致竞争条件，因此已经投入了巨大的努力来优化数据依赖关系。通过使用一系列不同网格大小的数值测试，仔细比较我们优化的图形着色和原子操作，结果表明，在Nvidia Tesla GPU V100上的所有测试用例中，原子操作比我们优化的图形着色更有效。具体来说，对于求和操作，使用atomicAdd的速度是图形着色速度的两倍。对于最大的操作，atomicMax与图着色的速度提高了1.5到2。

{"title":"Re-evaluation of Atomic Operations and Graph Coloring for Unstructured Finite Volume GPU Simulations","authors":"Xi Zhang, Xu Sun, Xiaohu Guo, Yunfei Du, Yutong Lu, Yang Liu","doi":"10.1109/SBAC-PAD49847.2020.00048","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00048","url":null,"abstract":"In general, race condition can be resolved by introducing synchronisations or breaking data dependencies. Atomic operations and graph coloring are the two typical approaches to avoid race condition. Graph coloring algorithms have been generally considered winning algorithms in the literature due to their lock free implementations. In this paper, we present the GPU-accelerated algorithms of the unstructured cell-centered finite volume Computational Fluid Dynamics (CFD) software framework named PHengLEI which was originally developed for aerodynamics applications with arbitrary hybrid meshes. Overall, the newly developed GPU framework demonstrate up to 4.8 speedup comparing with 18 MPI tasks run on the latest Intel CPU node. Furthermore, the enormous efforts have been invested to optimize data dependencies which could lead to race condition due to unstructured mesh indirect addressing and related reduction math operations. With careful comparison between our optimised graph coloring and atomic operations using a series of numerical tests with different mesh sizes, the results show that atomic operations are more efficient than our optimised graph coloring in all of the test cases on Nvidia Tesla GPU V100. Specifically, for the summation operation, using atomicAdd is twice as fast as graph coloring. For the maximum operation, a speedup of 1.5 to 2 is found for atomicMax vs. graph coloring.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131709108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Extending Heterogeneous Applications to Remote Co-processors with rOpenCL 用rOpenCL将异构应用扩展到远程协处理器

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00049

Rui Alves, J. Rufino

In heterogeneous computing systems, general purpose CPUs are coupled with co-processors of different architectures, like GPUs and FPGAs. Applications may take advantage of this heterogeneous device ensemble to accelerate execution. However, developing heterogeneous applications requires specific programming models, under which applications unfold into code components targeting different computing devices. OpenCL is one of the main programming models for heterogeneous applications, set apart from others due to its openness, vendor independence and support for different co-processors. In the original OpenCL application model, a heterogeneous application starts in a certain host node, and then resorts to the local co-processors attached to that host. Therefore, co-processors at other nodes, networked with the host node, are inaccessible and cannot be used to accelerate the application. rOpenCL (remote OpenCL) overcomes this limitation for a significant set of the OpenCL 1.2 API, offering OpenCL applications transparent access to remote devices through a TPC/IP based network. This paper presents the architecture and the most relevant implementation details of rOpenCL, together with the results of a preliminary set of reference benchmarks. These prove the stability of the current prototype and show that, in many scenarios, the network overhead is smaller than expected.

在异构计算系统中，通用cpu与不同架构的协处理器(如gpu和fpga)耦合在一起。应用程序可以利用这种异构设备集成来加速执行。然而，开发异构应用程序需要特定的编程模型，在这种模型下，应用程序展开为针对不同计算设备的代码组件。OpenCL是异构应用程序的主要编程模型之一，由于其开放性、供应商独立性和对不同协处理器的支持而与其他模型区别开来。在最初的OpenCL应用程序模型中，异构应用程序在某个主机节点中启动，然后使用连接到该主机的本地协处理器。因此，与主机节点联网的其他节点上的协处理器是不可访问的，不能用来加速应用程序。rOpenCL(远程OpenCL)为OpenCL 1.2 API的重要集合克服了这一限制，通过基于TPC/IP的网络为OpenCL应用程序提供了对远程设备的透明访问。本文介绍了rOpenCL的体系结构和最相关的实现细节，以及一组初步参考基准的结果。这些证明了当前原型的稳定性，并表明，在许多情况下，网络开销比预期的要小。

{"title":"Extending Heterogeneous Applications to Remote Co-processors with rOpenCL","authors":"Rui Alves, J. Rufino","doi":"10.1109/SBAC-PAD49847.2020.00049","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00049","url":null,"abstract":"In heterogeneous computing systems, general purpose CPUs are coupled with co-processors of different architectures, like GPUs and FPGAs. Applications may take advantage of this heterogeneous device ensemble to accelerate execution. However, developing heterogeneous applications requires specific programming models, under which applications unfold into code components targeting different computing devices. OpenCL is one of the main programming models for heterogeneous applications, set apart from others due to its openness, vendor independence and support for different co-processors. In the original OpenCL application model, a heterogeneous application starts in a certain host node, and then resorts to the local co-processors attached to that host. Therefore, co-processors at other nodes, networked with the host node, are inaccessible and cannot be used to accelerate the application. rOpenCL (remote OpenCL) overcomes this limitation for a significant set of the OpenCL 1.2 API, offering OpenCL applications transparent access to remote devices through a TPC/IP based network. This paper presents the architecture and the most relevant implementation details of rOpenCL, together with the results of a preliminary set of reference benchmarks. These prove the stability of the current prototype and show that, in many scenarios, the network overhead is smaller than expected.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134321468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Highly Efficient SGEMM Implementation using DMA on the Intel/Movidius Myriad-2 在Intel/Movidius Myriad-2上使用DMA的高效SGEMM实现

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00051

Suyash Bakshi, L. Johnsson

Reducing energy consumption and achieving high energy efficiency in computation has become the top priority in High Performance Computing. High energy efficiency generally requires high resource utilization since energy demand for any applications and architectures is dependent on active time. We show that by using DMA the 28nm CMOS node Myriad-2 Vision Processing Unit can achieve 25 GFLOPs/W for FP32 matrixmultiplication. Our main contributions are: (i) An analysis of data transfer needs for inner and outer-product formulations of matrix multiplication with respect to the Myriad-2 memory hierarchy, (ii) An efficient use of DMA for managing matrix block transfers between on-chip and main memory (iii) A detailed analysis of the effects of matrix block shapes and DRAM page faults on performance and energy efficiency.

在计算中降低能耗和实现高能效已成为高性能计算的重中之重。高能效通常需要高资源利用率，因为任何应用程序和架构的能源需求都依赖于活动时间。我们表明，通过使用DMA, 28纳米CMOS节点Myriad-2视觉处理单元可以实现25 GFLOPs/W的FP32矩阵乘法。我们的主要贡献是:(i)分析了矩阵乘法的内部和外部乘积公式对Myriad-2存储器层次结构的数据传输需求，(ii)有效使用DMA来管理片上和主存储器之间的矩阵块传输(iii)详细分析了矩阵块形状和DRAM页面错误对性能和能源效率的影响。

引用次数: 2

JAMPI: A C++ Parallel Programming Interface Allowing the Implementation of Custom and Generic Scheduling Mechanisms 一个c++并行编程接口，允许实现自定义和通用调度机制

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00045

D. D. Domenico, G. H. Cavalheiro

The widespread of modern parallel architectures brought many challenges in terms of programming. In response, many parallel programming tools intend to aid the user in order to exploit hardware resources effectively. As a new alternative, this paper introduces the design and implementation of JAMPI, a generic parallel programming interface developed in C++ focused on code reuse, productivity and high-level abstraction to enable the construction of parallel applications. JAMPI is totally integrated with its host programming language and offers, as main feature, a fully disassociation of its programming interface from its scheduling mechanism. Aiming to manage and optimize the parallel execution, the proposed interface allows the programmer to implement a custom scheduling heuristic for each portion of the application code. Besides JAMPI model description, we proceeded some preliminary experiments using applications encoded with the proposed framework. Results showed that JAMPI can be used to reach performance on multicore platforms and does not add performance penalty over the sequential version of the benchmarks.

现代并行体系结构的广泛应用给编程带来了许多挑战。因此，许多并行编程工具打算帮助用户有效地利用硬件资源。作为一种新的替代方案，本文介绍了JAMPI的设计和实现，这是一种用c++开发的通用并行编程接口，专注于代码重用、生产力和高级抽象，以实现并行应用程序的构建。JAMPI完全集成了它的宿主编程语言，并且作为主要特性，它的编程接口与调度机制完全分离。为了管理和优化并行执行，建议的接口允许程序员为应用程序代码的每个部分实现自定义调度启发式。除了对JAMPI模型进行描述外，我们还对使用该框架编码的应用程序进行了一些初步实验。结果表明，JAMPI可以用于在多核平台上达到性能，并且不会比连续版本的基准测试增加性能损失。

{"title":"JAMPI: A C++ Parallel Programming Interface Allowing the Implementation of Custom and Generic Scheduling Mechanisms","authors":"D. D. Domenico, G. H. Cavalheiro","doi":"10.1109/SBAC-PAD49847.2020.00045","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00045","url":null,"abstract":"The widespread of modern parallel architectures brought many challenges in terms of programming. In response, many parallel programming tools intend to aid the user in order to exploit hardware resources effectively. As a new alternative, this paper introduces the design and implementation of JAMPI, a generic parallel programming interface developed in C++ focused on code reuse, productivity and high-level abstraction to enable the construction of parallel applications. JAMPI is totally integrated with its host programming language and offers, as main feature, a fully disassociation of its programming interface from its scheduling mechanism. Aiming to manage and optimize the parallel execution, the proposed interface allows the programmer to implement a custom scheduling heuristic for each portion of the application code. Besides JAMPI model description, we proceeded some preliminary experiments using applications encoded with the proposed framework. Results showed that JAMPI can be used to reach performance on multicore platforms and does not add performance penalty over the sequential version of the benchmarks.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Evaluating Computation and Data Placements in Edge Infrastructures through a Common Simulator 通过通用模拟器评估边缘基础设施中的计算和数据放置

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00020

A. Silva, Clément Mommessin, P. Neyron, D. Trystram, Adwait Bauskar, A. Lèbre, Alexandre van Kempen, Yanik Ngoko, Yoann Ricordel

Scheduling computational jobs with data-sets dependencies is an important challenge of edge computing infrastructures. Although several strategies have been proposed, they have been evaluated through ad-hoc simulator extensions that are, when available, usually not maintained. This is a critical problem because it prevents researchers to –easily– perform fair comparisons between different proposals. In this paper, we propose to address this limitation by presenting a simulation engine dedicated to the evaluation and comparison of scheduling and data movement policies for edge computing use-cases. Built upon the Batsim/SimGrid toolkit, our tool includes an injector that allows the simulator to replay a series of events captured in real infrastructures. It also includes a controller that supervises storage entities and data transfers during the simulation, and a plug-in system that allows researchers to add new models to cope with the diversity of edge computing devices. We demonstrate the relevance of such a simulation toolkit by studying two scheduling strategies with four data movement policies on top of a simulated version of the Qarnot Computing platform, a production edge infrastructure based on smart heaters. We chose this use-case as it illustrates the heterogeneity as well as the uncertainties of edge infrastructures. Our ultimate goal is to gather industry and academics around a common simulator so that efforts made by one group can be factorised by others.

调度具有数据集依赖性的计算作业是边缘计算基础设施面临的一个重要挑战。虽然已经提出了几种策略，但它们都是通过特别的模拟器扩展进行评估的，这些扩展在可用时通常不进行维护。这是一个关键的问题，因为它使研究人员无法轻易地对不同的提案进行公平的比较。在本文中，我们建议通过提供一个模拟引擎来解决这一限制，该引擎专门用于边缘计算用例的调度和数据移动策略的评估和比较。基于Batsim/SimGrid工具包，我们的工具包括一个注入器，允许模拟器重播在真实基础设施中捕获的一系列事件。它还包括一个在模拟过程中监督存储实体和数据传输的控制器，以及一个允许研究人员添加新模型以应对边缘计算设备多样性的插件系统。我们通过在Qarnot计算平台(一种基于智能加热器的生产边缘基础设施)的模拟版本上研究具有四种数据移动策略的两种调度策略，展示了这种模拟工具包的相关性。我们选择这个用例是因为它说明了边缘基础设施的异质性和不确定性。我们的最终目标是将工业界和学术界聚集在一个共同的模拟器周围，这样一个群体的努力就可以被其他群体所消化。

{"title":"Evaluating Computation and Data Placements in Edge Infrastructures through a Common Simulator","authors":"A. Silva, Clément Mommessin, P. Neyron, D. Trystram, Adwait Bauskar, A. Lèbre, Alexandre van Kempen, Yanik Ngoko, Yoann Ricordel","doi":"10.1109/SBAC-PAD49847.2020.00020","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00020","url":null,"abstract":"Scheduling computational jobs with data-sets dependencies is an important challenge of edge computing infrastructures. Although several strategies have been proposed, they have been evaluated through ad-hoc simulator extensions that are, when available, usually not maintained. This is a critical problem because it prevents researchers to –easily– perform fair comparisons between different proposals. In this paper, we propose to address this limitation by presenting a simulation engine dedicated to the evaluation and comparison of scheduling and data movement policies for edge computing use-cases. Built upon the Batsim/SimGrid toolkit, our tool includes an injector that allows the simulator to replay a series of events captured in real infrastructures. It also includes a controller that supervises storage entities and data transfers during the simulation, and a plug-in system that allows researchers to add new models to cope with the diversity of edge computing devices. We demonstrate the relevance of such a simulation toolkit by studying two scheduling strategies with four data movement policies on top of a simulated version of the Qarnot Computing platform, a production edge infrastructure based on smart heaters. We chose this use-case as it illustrates the heterogeneity as well as the uncertainties of edge infrastructures. Our ultimate goal is to gather industry and academics around a common simulator so that efforts made by one group can be factorised by others.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133012236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0