Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00041
Ioannis Vardas, Manolis Ploumidis, M. Marazakis
HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.
{"title":"Towards Communication Profile, Topology and Node Failure Aware Process Placement","authors":"Ioannis Vardas, Manolis Ploumidis, M. Marazakis","doi":"10.1109/SBAC-PAD49847.2020.00041","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00041","url":null,"abstract":"HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127903331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00023
Pablo San Juan, Adrián Castelló, M. F. Dolz, P. Alonso-Jordá, E. S. Quintana‐Ortí
The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the IM2COL transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.
{"title":"High Performance and Portable Convolution Operators for Multicore Processors","authors":"Pablo San Juan, Adrián Castelló, M. F. Dolz, P. Alonso-Jordá, E. S. Quintana‐Ortí","doi":"10.1109/SBAC-PAD49847.2020.00023","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00023","url":null,"abstract":"The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the IM2COL transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"2002 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131372439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00033
David Quaresma, Daniel Fireman, T. Pereira
Runtime environments like Java's JRE, .NET's CLR, and Ruby's MRI, are popular choices for cloud-based applications and particularly in the Function as a Service (FaaS) serverless computing context. A critical component of runtime environments of these languages is the garbage collector (GC). The GC frees developers from manual memory management, which could potentially ease development and avoid bugs. The benefits of using the GC come with a negative impact on performance; that impact happens because either the GC needs to pause the runtime execution or competes with the running program for computational resources. In this work, we evaluated the usage of a technique - Garbage Collector Control Interceptor (GCI) - that eliminates the negative impact of GC on performance by controlling GC executions and transparently shedding requests while the collections are happening. We executed experiments simulating AWS Lambda's behavior and found that GCI is a viable solution. It benefited the user by improving the response time up to 10.86% at 99.9th percentile and reducing cost by 7.22%, but it also helped the platform provider by improving resource utilization by 14.52%.
运行时环境,如Java的JRE、。net的CLR和Ruby的MRI,是基于云的应用程序的流行选择,特别是在函数即服务(FaaS)无服务器计算环境中。这些语言运行时环境的一个关键组件是垃圾收集器(GC)。GC将开发人员从手动内存管理中解放出来,这可能会简化开发并避免错误。使用GC的好处伴随着对性能的负面影响;这种影响的发生是因为GC需要暂停运行时执行,或者与正在运行的程序竞争计算资源。在这项工作中,我们评估了一种技术的使用情况——垃圾收集器控制拦截器(Garbage Collector Control Interceptor, GCI)——它通过控制GC执行和在收集发生时透明地释放请求来消除GC对性能的负面影响。我们执行了模拟AWS Lambda行为的实验,发现GCI是一个可行的解决方案。它使用户受益,响应时间提高了10.86%(99.9百分位),成本降低了7.22%,但它也帮助平台提供商提高了14.52%的资源利用率。
{"title":"Controlling Garbage Collection and Request Admission to Improve Performance of FaaS Applications","authors":"David Quaresma, Daniel Fireman, T. Pereira","doi":"10.1109/SBAC-PAD49847.2020.00033","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00033","url":null,"abstract":"Runtime environments like Java's JRE, .NET's CLR, and Ruby's MRI, are popular choices for cloud-based applications and particularly in the Function as a Service (FaaS) serverless computing context. A critical component of runtime environments of these languages is the garbage collector (GC). The GC frees developers from manual memory management, which could potentially ease development and avoid bugs. The benefits of using the GC come with a negative impact on performance; that impact happens because either the GC needs to pause the runtime execution or competes with the running program for computational resources. In this work, we evaluated the usage of a technique - Garbage Collector Control Interceptor (GCI) - that eliminates the negative impact of GC on performance by controlling GC executions and transparently shedding requests while the collections are happening. We executed experiments simulating AWS Lambda's behavior and found that GCI is a viable solution. It benefited the user by improving the response time up to 10.86% at 99.9th percentile and reducing cost by 7.22%, but it also helped the platform provider by improving resource utilization by 14.52%.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00014
Rico Amslinger, Christian Piatka, Florian Haas, Sebastian Weis, T. Ungerer, S. Altmeyer
Modern safety-critical embedded applications like autonomous driving need to be fail-operational. At the same time, high performance and low power consumption are demanded. A common way to achieve this is the use of heterogeneous multi-cores. When applied to such systems, prevalent fault tolerance mechanisms suffer from some disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others (e.g. lockstep) require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism introduced by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores. Each thread is automatically split into transactions, which then execute redundantly. The hardware transactional memory is extended to support multiple versions, which allows the reproduction of atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores.
{"title":"Hardware Multiversioning for Fail-Operational Multithreaded Applications","authors":"Rico Amslinger, Christian Piatka, Florian Haas, Sebastian Weis, T. Ungerer, S. Altmeyer","doi":"10.1109/SBAC-PAD49847.2020.00014","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00014","url":null,"abstract":"Modern safety-critical embedded applications like autonomous driving need to be fail-operational. At the same time, high performance and low power consumption are demanded. A common way to achieve this is the use of heterogeneous multi-cores. When applied to such systems, prevalent fault tolerance mechanisms suffer from some disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others (e.g. lockstep) require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism introduced by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores. Each thread is automatically split into transactions, which then execute redundantly. The hardware transactional memory is extended to support multiple versions, which allows the reproduction of atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124541331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00025
Christina L. Peterson, Amalee Wilson, P. Pirkelbauer, D. Dechev
The optimistic concurrency control (OCC) utilized by in-memory databases performs writes on thread-local copies and makes the writes visible upon passing validation. However, high contention workloads suffer from failure of the validation step due to non-semantic memory access conflicts, leading to frequent transaction aborts. In this work, we improve the commit rate of in-memory databases by replacing OCC and the underlying indexing of key-value entries in the Silo database with a lock-free transactional dictionary. To further optimize the transactional commit rate, we present transactional merging, a technique that relaxes the semantic conflict resolution of transactional data structures by merging conflicting operations to reduce aborts. Transactional merging guarantees strict serializability through a strategy that recovers the correct abstract state given that a transaction attempting to merge operations aborts. The experimental evaluation demonstrates that the lock-free transactional dictionary with transactional merging achieves an average speedup of 175% over OCC and the Masstree indexing used in the Silo database for write-dominated workloads on a non-uniform memory access system.
{"title":"Optimized Transactional Data Structure Approach to Concurrency Control for In-Memory Databases","authors":"Christina L. Peterson, Amalee Wilson, P. Pirkelbauer, D. Dechev","doi":"10.1109/SBAC-PAD49847.2020.00025","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00025","url":null,"abstract":"The optimistic concurrency control (OCC) utilized by in-memory databases performs writes on thread-local copies and makes the writes visible upon passing validation. However, high contention workloads suffer from failure of the validation step due to non-semantic memory access conflicts, leading to frequent transaction aborts. In this work, we improve the commit rate of in-memory databases by replacing OCC and the underlying indexing of key-value entries in the Silo database with a lock-free transactional dictionary. To further optimize the transactional commit rate, we present transactional merging, a technique that relaxes the semantic conflict resolution of transactional data structures by merging conflicting operations to reduce aborts. Transactional merging guarantees strict serializability through a strategy that recovers the correct abstract state given that a transaction attempting to merge operations aborts. The experimental evaluation demonstrates that the lock-free transactional dictionary with transactional merging achieves an average speedup of 175% over OCC and the Masstree indexing used in the Silo database for write-dominated workloads on a non-uniform memory access system.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122872461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00048
Xi Zhang, Xu Sun, Xiaohu Guo, Yunfei Du, Yutong Lu, Yang Liu
In general, race condition can be resolved by introducing synchronisations or breaking data dependencies. Atomic operations and graph coloring are the two typical approaches to avoid race condition. Graph coloring algorithms have been generally considered winning algorithms in the literature due to their lock free implementations. In this paper, we present the GPU-accelerated algorithms of the unstructured cell-centered finite volume Computational Fluid Dynamics (CFD) software framework named PHengLEI which was originally developed for aerodynamics applications with arbitrary hybrid meshes. Overall, the newly developed GPU framework demonstrate up to 4.8 speedup comparing with 18 MPI tasks run on the latest Intel CPU node. Furthermore, the enormous efforts have been invested to optimize data dependencies which could lead to race condition due to unstructured mesh indirect addressing and related reduction math operations. With careful comparison between our optimised graph coloring and atomic operations using a series of numerical tests with different mesh sizes, the results show that atomic operations are more efficient than our optimised graph coloring in all of the test cases on Nvidia Tesla GPU V100. Specifically, for the summation operation, using atomicAdd is twice as fast as graph coloring. For the maximum operation, a speedup of 1.5 to 2 is found for atomicMax vs. graph coloring.
一般来说,竞争条件可以通过引入同步或打破数据依赖来解决。原子操作和图着色是避免竞争条件的两种典型方法。图着色算法由于其无锁实现而在文献中被普遍认为是获胜算法。本文介绍了非结构化单元中心有限体积计算流体动力学(CFD)软件框架PHengLEI的gpu加速算法,该框架最初是为任意混合网格的空气动力学应用而开发的。总体而言,与在最新英特尔CPU节点上运行的18个MPI任务相比,新开发的GPU框架显示出高达4.8的加速。此外,由于非结构化网格间接寻址和相关的简化数学操作,数据依赖关系可能导致竞争条件,因此已经投入了巨大的努力来优化数据依赖关系。通过使用一系列不同网格大小的数值测试,仔细比较我们优化的图形着色和原子操作,结果表明,在Nvidia Tesla GPU V100上的所有测试用例中,原子操作比我们优化的图形着色更有效。具体来说,对于求和操作,使用atomicAdd的速度是图形着色速度的两倍。对于最大的操作,atomicMax与图着色的速度提高了1.5到2。
{"title":"Re-evaluation of Atomic Operations and Graph Coloring for Unstructured Finite Volume GPU Simulations","authors":"Xi Zhang, Xu Sun, Xiaohu Guo, Yunfei Du, Yutong Lu, Yang Liu","doi":"10.1109/SBAC-PAD49847.2020.00048","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00048","url":null,"abstract":"In general, race condition can be resolved by introducing synchronisations or breaking data dependencies. Atomic operations and graph coloring are the two typical approaches to avoid race condition. Graph coloring algorithms have been generally considered winning algorithms in the literature due to their lock free implementations. In this paper, we present the GPU-accelerated algorithms of the unstructured cell-centered finite volume Computational Fluid Dynamics (CFD) software framework named PHengLEI which was originally developed for aerodynamics applications with arbitrary hybrid meshes. Overall, the newly developed GPU framework demonstrate up to 4.8 speedup comparing with 18 MPI tasks run on the latest Intel CPU node. Furthermore, the enormous efforts have been invested to optimize data dependencies which could lead to race condition due to unstructured mesh indirect addressing and related reduction math operations. With careful comparison between our optimised graph coloring and atomic operations using a series of numerical tests with different mesh sizes, the results show that atomic operations are more efficient than our optimised graph coloring in all of the test cases on Nvidia Tesla GPU V100. Specifically, for the summation operation, using atomicAdd is twice as fast as graph coloring. For the maximum operation, a speedup of 1.5 to 2 is found for atomicMax vs. graph coloring.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131709108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00049
Rui Alves, J. Rufino
In heterogeneous computing systems, general purpose CPUs are coupled with co-processors of different architectures, like GPUs and FPGAs. Applications may take advantage of this heterogeneous device ensemble to accelerate execution. However, developing heterogeneous applications requires specific programming models, under which applications unfold into code components targeting different computing devices. OpenCL is one of the main programming models for heterogeneous applications, set apart from others due to its openness, vendor independence and support for different co-processors. In the original OpenCL application model, a heterogeneous application starts in a certain host node, and then resorts to the local co-processors attached to that host. Therefore, co-processors at other nodes, networked with the host node, are inaccessible and cannot be used to accelerate the application. rOpenCL (remote OpenCL) overcomes this limitation for a significant set of the OpenCL 1.2 API, offering OpenCL applications transparent access to remote devices through a TPC/IP based network. This paper presents the architecture and the most relevant implementation details of rOpenCL, together with the results of a preliminary set of reference benchmarks. These prove the stability of the current prototype and show that, in many scenarios, the network overhead is smaller than expected.
{"title":"Extending Heterogeneous Applications to Remote Co-processors with rOpenCL","authors":"Rui Alves, J. Rufino","doi":"10.1109/SBAC-PAD49847.2020.00049","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00049","url":null,"abstract":"In heterogeneous computing systems, general purpose CPUs are coupled with co-processors of different architectures, like GPUs and FPGAs. Applications may take advantage of this heterogeneous device ensemble to accelerate execution. However, developing heterogeneous applications requires specific programming models, under which applications unfold into code components targeting different computing devices. OpenCL is one of the main programming models for heterogeneous applications, set apart from others due to its openness, vendor independence and support for different co-processors. In the original OpenCL application model, a heterogeneous application starts in a certain host node, and then resorts to the local co-processors attached to that host. Therefore, co-processors at other nodes, networked with the host node, are inaccessible and cannot be used to accelerate the application. rOpenCL (remote OpenCL) overcomes this limitation for a significant set of the OpenCL 1.2 API, offering OpenCL applications transparent access to remote devices through a TPC/IP based network. This paper presents the architecture and the most relevant implementation details of rOpenCL, together with the results of a preliminary set of reference benchmarks. These prove the stability of the current prototype and show that, in many scenarios, the network overhead is smaller than expected.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134321468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00051
Suyash Bakshi, L. Johnsson
Reducing energy consumption and achieving high energy efficiency in computation has become the top priority in High Performance Computing. High energy efficiency generally requires high resource utilization since energy demand for any applications and architectures is dependent on active time. We show that by using DMA the 28nm CMOS node Myriad-2 Vision Processing Unit can achieve 25 GFLOPs/W for FP32 matrixmultiplication. Our main contributions are: (i) An analysis of data transfer needs for inner and outer-product formulations of matrix multiplication with respect to the Myriad-2 memory hierarchy, (ii) An efficient use of DMA for managing matrix block transfers between on-chip and main memory (iii) A detailed analysis of the effects of matrix block shapes and DRAM page faults on performance and energy efficiency.
{"title":"A Highly Efficient SGEMM Implementation using DMA on the Intel/Movidius Myriad-2","authors":"Suyash Bakshi, L. Johnsson","doi":"10.1109/SBAC-PAD49847.2020.00051","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00051","url":null,"abstract":"Reducing energy consumption and achieving high energy efficiency in computation has become the top priority in High Performance Computing. High energy efficiency generally requires high resource utilization since energy demand for any applications and architectures is dependent on active time. We show that by using DMA the 28nm CMOS node Myriad-2 Vision Processing Unit can achieve 25 GFLOPs/W for FP32 matrixmultiplication. Our main contributions are: (i) An analysis of data transfer needs for inner and outer-product formulations of matrix multiplication with respect to the Myriad-2 memory hierarchy, (ii) An efficient use of DMA for managing matrix block transfers between on-chip and main memory (iii) A detailed analysis of the effects of matrix block shapes and DRAM page faults on performance and energy efficiency.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00045
D. D. Domenico, G. H. Cavalheiro
The widespread of modern parallel architectures brought many challenges in terms of programming. In response, many parallel programming tools intend to aid the user in order to exploit hardware resources effectively. As a new alternative, this paper introduces the design and implementation of JAMPI, a generic parallel programming interface developed in C++ focused on code reuse, productivity and high-level abstraction to enable the construction of parallel applications. JAMPI is totally integrated with its host programming language and offers, as main feature, a fully disassociation of its programming interface from its scheduling mechanism. Aiming to manage and optimize the parallel execution, the proposed interface allows the programmer to implement a custom scheduling heuristic for each portion of the application code. Besides JAMPI model description, we proceeded some preliminary experiments using applications encoded with the proposed framework. Results showed that JAMPI can be used to reach performance on multicore platforms and does not add performance penalty over the sequential version of the benchmarks.
{"title":"JAMPI: A C++ Parallel Programming Interface Allowing the Implementation of Custom and Generic Scheduling Mechanisms","authors":"D. D. Domenico, G. H. Cavalheiro","doi":"10.1109/SBAC-PAD49847.2020.00045","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00045","url":null,"abstract":"The widespread of modern parallel architectures brought many challenges in terms of programming. In response, many parallel programming tools intend to aid the user in order to exploit hardware resources effectively. As a new alternative, this paper introduces the design and implementation of JAMPI, a generic parallel programming interface developed in C++ focused on code reuse, productivity and high-level abstraction to enable the construction of parallel applications. JAMPI is totally integrated with its host programming language and offers, as main feature, a fully disassociation of its programming interface from its scheduling mechanism. Aiming to manage and optimize the parallel execution, the proposed interface allows the programmer to implement a custom scheduling heuristic for each portion of the application code. Besides JAMPI model description, we proceeded some preliminary experiments using applications encoded with the proposed framework. Results showed that JAMPI can be used to reach performance on multicore platforms and does not add performance penalty over the sequential version of the benchmarks.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01DOI: 10.1109/SBAC-PAD49847.2020.00020
A. Silva, Clément Mommessin, P. Neyron, D. Trystram, Adwait Bauskar, A. Lèbre, Alexandre van Kempen, Yanik Ngoko, Yoann Ricordel
Scheduling computational jobs with data-sets dependencies is an important challenge of edge computing infrastructures. Although several strategies have been proposed, they have been evaluated through ad-hoc simulator extensions that are, when available, usually not maintained. This is a critical problem because it prevents researchers to –easily– perform fair comparisons between different proposals. In this paper, we propose to address this limitation by presenting a simulation engine dedicated to the evaluation and comparison of scheduling and data movement policies for edge computing use-cases. Built upon the Batsim/SimGrid toolkit, our tool includes an injector that allows the simulator to replay a series of events captured in real infrastructures. It also includes a controller that supervises storage entities and data transfers during the simulation, and a plug-in system that allows researchers to add new models to cope with the diversity of edge computing devices. We demonstrate the relevance of such a simulation toolkit by studying two scheduling strategies with four data movement policies on top of a simulated version of the Qarnot Computing platform, a production edge infrastructure based on smart heaters. We chose this use-case as it illustrates the heterogeneity as well as the uncertainties of edge infrastructures. Our ultimate goal is to gather industry and academics around a common simulator so that efforts made by one group can be factorised by others.
{"title":"Evaluating Computation and Data Placements in Edge Infrastructures through a Common Simulator","authors":"A. Silva, Clément Mommessin, P. Neyron, D. Trystram, Adwait Bauskar, A. Lèbre, Alexandre van Kempen, Yanik Ngoko, Yoann Ricordel","doi":"10.1109/SBAC-PAD49847.2020.00020","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00020","url":null,"abstract":"Scheduling computational jobs with data-sets dependencies is an important challenge of edge computing infrastructures. Although several strategies have been proposed, they have been evaluated through ad-hoc simulator extensions that are, when available, usually not maintained. This is a critical problem because it prevents researchers to –easily– perform fair comparisons between different proposals. In this paper, we propose to address this limitation by presenting a simulation engine dedicated to the evaluation and comparison of scheduling and data movement policies for edge computing use-cases. Built upon the Batsim/SimGrid toolkit, our tool includes an injector that allows the simulator to replay a series of events captured in real infrastructures. It also includes a controller that supervises storage entities and data transfers during the simulation, and a plug-in system that allows researchers to add new models to cope with the diversity of edge computing devices. We demonstrate the relevance of such a simulation toolkit by studying two scheduling strategies with four data movement policies on top of a simulated version of the Qarnot Computing platform, a production edge infrastructure based on smart heaters. We chose this use-case as it illustrates the heterogeneity as well as the uncertainties of edge infrastructures. Our ultimate goal is to gather industry and academics around a common simulator so that efforts made by one group can be factorised by others.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133012236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}