Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.18
Maicon Anca dos Santos, A. R. D. Bois, G. H. Cavalheiro
This paper presents a high level model to describe bag of tasks (BoT) applications and a framework to evaluate user level approaches to scheduler BoTs on coarser works units. The scheduler consolidates the load of the tasks in a given number of virtual machines (VMs) providing the estimated makespan. The framework allows to change the policy of tasks selection in order to compare the length of the scheduling produced giving a limited number of VMs. The framework has as input a BoT description and produces for each VM its trace of processing load. This paper validates the BoT model and the proposed framework with a performance assessment. In our case studies, the output of the framework is submitted to a real OpenStack based IaaS infrastructure. The results show that the makespan can be reduced by grouping tasks in coarse units of loads.
{"title":"A User-Level Scheduling Framework for BoT Applications on Private Clouds","authors":"Maicon Anca dos Santos, A. R. D. Bois, G. H. Cavalheiro","doi":"10.1109/SBAC-PAD.2017.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.18","url":null,"abstract":"This paper presents a high level model to describe bag of tasks (BoT) applications and a framework to evaluate user level approaches to scheduler BoTs on coarser works units. The scheduler consolidates the load of the tasks in a given number of virtual machines (VMs) providing the estimated makespan. The framework allows to change the policy of tasks selection in order to compare the length of the scheduling produced giving a limited number of VMs. The framework has as input a BoT description and produces for each VM its trace of processing load. This paper validates the BoT model and the proposed framework with a performance assessment. In our case studies, the output of the framework is submitted to a real OpenStack based IaaS infrastructure. The results show that the makespan can be reduced by grouping tasks in coarse units of loads.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126229403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.17
Qixiao Liu, Miquel Moretó, J. Abella, F. Cazorla, M. Valero
As the energy cost in todays computing systems keeps increasing, measuring the energy becomes crucial in many scenarios. For instance, due to the fact that the operational cost of datacenters largely depends on the energy consumed by the applications executed, end users should be charged for the energy consumed, which requires a fair and consistent energy measuring approach. However, the use of multicore system complicates per-task energy measurement as the increased Thread Level Parallelism (TLP) allows several tasks to run simultaneously sharing resources. Therefore, the energy usage of each task is hard to determine due to interleaved activities and mutual interferences. To this end, Per-Task Energy Metering (PTEM) has been proposed to measure the actual energy of each task based on their resource utilization in a workload. However, the measured energy depends on the interferences from co-running tasks sharing the resources, and thus fails to provide the consistency across executions. Therefore, Sensible Energy Accounting (SEA) has been proposed to deliver an abstraction of the energy consumption based on a particular allocation of resources to a task.In this work we provide a realization of SEA for the DRAM memory system, SEDEA, where we account a task for the DRAM energy it would have consumed when running in isolation with a fraction of the on-chip shared cache. SEDEA is a mechanism to sensibly account for the DRAM energy of a task based on predicting its memory behavior. Our results show that SEDEA provides accurate estimates, yet with low-cost, beating existing per-task energy models, which do not target accounting energy in multicore system. We also provide a use case showing that SEDEA can be used to guide shared cache and memory bank partition schemes to save energy.
{"title":"SEDEA: A Sensible Approach to Account DRAM Energy in Multicore Systems","authors":"Qixiao Liu, Miquel Moretó, J. Abella, F. Cazorla, M. Valero","doi":"10.1109/SBAC-PAD.2017.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.17","url":null,"abstract":"As the energy cost in todays computing systems keeps increasing, measuring the energy becomes crucial in many scenarios. For instance, due to the fact that the operational cost of datacenters largely depends on the energy consumed by the applications executed, end users should be charged for the energy consumed, which requires a fair and consistent energy measuring approach. However, the use of multicore system complicates per-task energy measurement as the increased Thread Level Parallelism (TLP) allows several tasks to run simultaneously sharing resources. Therefore, the energy usage of each task is hard to determine due to interleaved activities and mutual interferences. To this end, Per-Task Energy Metering (PTEM) has been proposed to measure the actual energy of each task based on their resource utilization in a workload. However, the measured energy depends on the interferences from co-running tasks sharing the resources, and thus fails to provide the consistency across executions. Therefore, Sensible Energy Accounting (SEA) has been proposed to deliver an abstraction of the energy consumption based on a particular allocation of resources to a task.In this work we provide a realization of SEA for the DRAM memory system, SEDEA, where we account a task for the DRAM energy it would have consumed when running in isolation with a fraction of the on-chip shared cache. SEDEA is a mechanism to sensibly account for the DRAM energy of a task based on predicting its memory behavior. Our results show that SEDEA provides accurate estimates, yet with low-cost, beating existing per-task energy models, which do not target accounting energy in multicore system. We also provide a use case showing that SEDEA can be used to guide shared cache and memory bank partition schemes to save energy.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133436356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.29
Carlos E. Gómez, Harold E. Castro, Carlos A. Varela
Recently, a new concept called desktop cloud emerged, which was developed to offer cloud computing services on non-dedicated resources. Similarly to cloud computing, desktop clouds are based on virtualization, and like other computational systems, may experience faults at any time. As a consequence, reliability has become a concern for researchers. Fault-tolerance strategies focused on independent virtual machines include snapshots (checkpoints) to resume the execution from a healthy state of a virtual machine on the same or another host, which is trivial because hypervisors provide this function. However, it is not trivial to obtain a global snapshot of a distributed system formed by applications that communicate among them because the concept of global clock does not exist, so it can not be guaranteed that snapshots of each VM will be taken at the same time. Therefore, some protocol is needed to coordinate the participants to obtain a global snapshot. In this paper, we propose a global snapshot protocol called UnaCloud Snapshot for its application in the context of desktop clouds over TCP/IP networks. That differs from other proposals that use a virtual network to inspect and manipulate the traffic circulating among virtual machines making it difficult to apply them to more realistic environments. We obtain a consistent global snapshot for a general distributed system running on virtual machines that maintains the semantics of the system without modifying applications running on virtual machines or hypervisors. A first prototype was developed and the preliminary results of our evaluation are presented.
{"title":"Global Snapshot of a Distributed System Running on Virtual Machines","authors":"Carlos E. Gómez, Harold E. Castro, Carlos A. Varela","doi":"10.1109/SBAC-PAD.2017.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.29","url":null,"abstract":"Recently, a new concept called desktop cloud emerged, which was developed to offer cloud computing services on non-dedicated resources. Similarly to cloud computing, desktop clouds are based on virtualization, and like other computational systems, may experience faults at any time. As a consequence, reliability has become a concern for researchers. Fault-tolerance strategies focused on independent virtual machines include snapshots (checkpoints) to resume the execution from a healthy state of a virtual machine on the same or another host, which is trivial because hypervisors provide this function. However, it is not trivial to obtain a global snapshot of a distributed system formed by applications that communicate among them because the concept of global clock does not exist, so it can not be guaranteed that snapshots of each VM will be taken at the same time. Therefore, some protocol is needed to coordinate the participants to obtain a global snapshot. In this paper, we propose a global snapshot protocol called UnaCloud Snapshot for its application in the context of desktop clouds over TCP/IP networks. That differs from other proposals that use a virtual network to inspect and manipulate the traffic circulating among virtual machines making it difficult to apply them to more realistic environments. We obtain a consistent global snapshot for a general distributed system running on virtual machines that maintains the semantics of the system without modifying applications running on virtual machines or hypervisors. A first prototype was developed and the preliminary results of our evaluation are presented.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123438823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.26
M. Areias, Ricardo Rocha
Hash tries are a trie-based data structure with nearly ideal characteristics for the implementation of hash maps. In this paper, we present a novel, simple and scalable hash trie map design that fully supports the concurrent search, insert and remove operations on hash maps. To the best of our knowledge, our proposal is the first concurrent hash map design that puts together the following characteristics: (i) be lock-free; (ii) use fixed size data structures; and (iii) maintain the access to all internal data structures as persistent memory references. Experimental results show that our proposal is quite competitive when compared against other state-of-the-art proposals implemented in Java. Its design is modular enough to allow different types of configurations aimed for different performances in memory usage and execution time.
{"title":"Towards a Lock-Free, Fixed Size and Persistent Hash Map Design","authors":"M. Areias, Ricardo Rocha","doi":"10.1109/SBAC-PAD.2017.26","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.26","url":null,"abstract":"Hash tries are a trie-based data structure with nearly ideal characteristics for the implementation of hash maps. In this paper, we present a novel, simple and scalable hash trie map design that fully supports the concurrent search, insert and remove operations on hash maps. To the best of our knowledge, our proposal is the first concurrent hash map design that puts together the following characteristics: (i) be lock-free; (ii) use fixed size data structures; and (iii) maintain the access to all internal data structures as persistent memory references. Experimental results show that our proposal is quite competitive when compared against other state-of-the-art proposals implemented in Java. Its design is modular enough to allow different types of configurations aimed for different performances in memory usage and execution time.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129689111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.23
Daniel Nemirovsky, Tugberk Arkose, Nikola Marković, M. Nemirovsky, O. Unsal, A. Cristal
As heterogeneous systems become more ubiquitous, computer architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers seeking to improve system performance. Recent advances in machine learning techniques including artificial neural network models have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput.In this paper we propose a unique throughput maximizing heterogeneous CPU scheduling model that uses machine learning to predict the performance of multiple threads on diverse system resources at the scheduling quantum granularity. We demonstrate how lightweight artificial neural networks (ANNs) can provide highly accurate performance predictions for a diverse set of applications thereby helping to improve heterogeneous scheduling efficiency. We show that online training is capable of increasing prediction accuracy but deepening the complexity of the ANNs can result in diminishing returns. Notably, our approach yields 25% to 31% throughput improvements over conventional heterogeneous schedulers for CPU and memory intensive applications.
{"title":"A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs","authors":"Daniel Nemirovsky, Tugberk Arkose, Nikola Marković, M. Nemirovsky, O. Unsal, A. Cristal","doi":"10.1109/SBAC-PAD.2017.23","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.23","url":null,"abstract":"As heterogeneous systems become more ubiquitous, computer architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers seeking to improve system performance. Recent advances in machine learning techniques including artificial neural network models have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput.In this paper we propose a unique throughput maximizing heterogeneous CPU scheduling model that uses machine learning to predict the performance of multiple threads on diverse system resources at the scheduling quantum granularity. We demonstrate how lightweight artificial neural networks (ANNs) can provide highly accurate performance predictions for a diverse set of applications thereby helping to improve heterogeneous scheduling efficiency. We show that online training is capable of increasing prediction accuracy but deepening the complexity of the ANNs can result in diminishing returns. Notably, our approach yields 25% to 31% throughput improvements over conventional heterogeneous schedulers for CPU and memory intensive applications.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126874372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.16
R. Auler, E. Borin
For a long time the Instruction Set Architecture (ISA) has been the firm contract between software and hardware. This firm contract plays an important role by decoupling the development of software from hardware micro-architectural features, enabling both to evolve independently. Nonetheless, it also condemns the ISA to become larger, more cluttered and inefficient as new instructions are incorporated over the years and deprecated instructions are left untouched to keep legacy compatibility. In this work we propose OpenISA, a flexible ISA that enables both the software and the hardware to evolve independently and discuss how OpenISA 1.0 was designed to enable efficient OpenISA software emulation on alien ISAs, which is key to free the user from hardware lock-ins. Our results show that software compiled to OpenISA can be latter emulated on x86 and ARM processors with very little overhead achieving near native performance, under 10% for the majority of programs.
{"title":"The Case for Flexible ISAs: Unleashing Hardware and Software","authors":"R. Auler, E. Borin","doi":"10.1109/SBAC-PAD.2017.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.16","url":null,"abstract":"For a long time the Instruction Set Architecture (ISA) has been the firm contract between software and hardware. This firm contract plays an important role by decoupling the development of software from hardware micro-architectural features, enabling both to evolve independently. Nonetheless, it also condemns the ISA to become larger, more cluttered and inefficient as new instructions are incorporated over the years and deprecated instructions are left untouched to keep legacy compatibility. In this work we propose OpenISA, a flexible ISA that enables both the software and the hardware to evolve independently and discuss how OpenISA 1.0 was designed to enable efficient OpenISA software emulation on alien ISAs, which is key to free the user from hardware lock-ins. Our results show that software compiled to OpenISA can be latter emulated on x86 and ARM processors with very little overhead achieving near native performance, under 10% for the majority of programs.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123344881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.21
Eran Gilad, Tehila Mayzels, Elazar Raab, M. Oskin, Yoav Etsion
Task-based programming models aim to simplify parallel programming. A runtime system schedules tasks to execute on cores. An essential component of this runtime is to track and manage dependencies between tasks. A typical approach is to rely on programmers to annotate tasks and data structures, essentially manually specifying the input and output of each task. As such, dependencies are associated with named program objects, making this approach problematic for pointer-based data structures. Furthermore, because the runtime system must track these dependencies, for efficient runtime performance the read and write sets should be kept small.We presume a memory system with architecturally visible support for multiple versions of data stored at the same program address. This paper proposes and evaluates a task-based execution model that uses this versioned memory system to deterministically parallelize sequential code. We have built a task-based runtime layer that uses this type of memory system for dependence tracking. We demonstrate the advantages of the proposed model by parallelizing pointer-heavy code, obtaining speedup of up to 19x on a 32-core system.
{"title":"Towards a Deterministic Fine-Grained Task Ordering Using Multi-Versioned Memory","authors":"Eran Gilad, Tehila Mayzels, Elazar Raab, M. Oskin, Yoav Etsion","doi":"10.1109/SBAC-PAD.2017.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.21","url":null,"abstract":"Task-based programming models aim to simplify parallel programming. A runtime system schedules tasks to execute on cores. An essential component of this runtime is to track and manage dependencies between tasks. A typical approach is to rely on programmers to annotate tasks and data structures, essentially manually specifying the input and output of each task. As such, dependencies are associated with named program objects, making this approach problematic for pointer-based data structures. Furthermore, because the runtime system must track these dependencies, for efficient runtime performance the read and write sets should be kept small.We presume a memory system with architecturally visible support for multiple versions of data stored at the same program address. This paper proposes and evaluates a task-based execution model that uses this versioned memory system to deterministically parallelize sequential code. We have built a task-based runtime layer that uses this type of memory system for dependence tracking. We demonstrate the advantages of the proposed model by parallelizing pointer-heavy code, obtaining speedup of up to 19x on a 32-core system.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124764548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sousa, M. Pereira, Fernando Magno Quintão Pereira, G. Araújo
Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. DCO was implemented in AClang LLVM (www.aclang.org) a compiler capable of translating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding the complexity of directly programming in OpenCL. Experimental results using DCA and DCO in AClang to compile programs from the Parboil, Polybench and Rodinia benchmarks reveal performance speed-ups of up to 5.25x on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 2.03x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit.
{"title":"Data Coherence Analysis and Optimization for Heterogeneous Computing","authors":"R. Sousa, M. Pereira, Fernando Magno Quintão Pereira, G. Araújo","doi":"10.1109/SBAC-PAD.2017.9","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.9","url":null,"abstract":"Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. DCO was implemented in AClang LLVM (www.aclang.org) a compiler capable of translating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding the complexity of directly programming in OpenCL. Experimental results using DCA and DCO in AClang to compile programs from the Parboil, Polybench and Rodinia benchmarks reveal performance speed-ups of up to 5.25x on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 2.03x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129115541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.25
Shijie Zhou, V. Prasanna
Hardware accelerators for graph analytics have gained increasing interest. Vertex-centric and edge-centric paradigms are widely used to design graph analytics accelerators. However, both of them have notable drawbacks: vertex-centric paradigm requires random memory accesses to traverse edges and edge-centric paradigm results in redundant edge traversals. In this paper, we explore the tradeoffs between vertex-centric and edge-centric paradigms and propose a hybrid algorithm which dynamically selects between them during the execution. We introduce the notion of active vertex ratio, based on which we develop a simple but efficient paradigm selection approach. We develop a hybrid data structure to concurrently support vertex-centric and edge-centric paradigms. Based on the hybrid data structure, we propose a graph partitioning scheme to increase parallelism and enable efficient parallel computation on heterogeneous platforms. In each iteration, we use our paradigm selection approach to select the appropriate paradigm for each partition. Further, we map our hybrid algorithm onto a stateof- the-art heterogeneous platform which integrates a multi-core CPU and a Field-Programmable Gate Array (FPGA) in a cache coherent fashion. We use our design methodology to accelerate two fundamental graph algorithms, breadth-first search (BFS) and single-source shortest path (SSSP). Experimental results show that our CPU-FPGA co-processing achieves up to 1.5× (1.9×) speedup for BFS (SSSP) compared with optimized baseline designs. Compared with the state-of-the-art FPGA-based designs, our design achieves up to 4.0× (4.2×) throughput improvement for BFS (SSSP). Compared with a state-of-the-art multi-core design, our design demonstrates up to 1.5× (1.8×) speedup for BFS (SSSP).
{"title":"Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform","authors":"Shijie Zhou, V. Prasanna","doi":"10.1109/SBAC-PAD.2017.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.25","url":null,"abstract":"Hardware accelerators for graph analytics have gained increasing interest. Vertex-centric and edge-centric paradigms are widely used to design graph analytics accelerators. However, both of them have notable drawbacks: vertex-centric paradigm requires random memory accesses to traverse edges and edge-centric paradigm results in redundant edge traversals. In this paper, we explore the tradeoffs between vertex-centric and edge-centric paradigms and propose a hybrid algorithm which dynamically selects between them during the execution. We introduce the notion of active vertex ratio, based on which we develop a simple but efficient paradigm selection approach. We develop a hybrid data structure to concurrently support vertex-centric and edge-centric paradigms. Based on the hybrid data structure, we propose a graph partitioning scheme to increase parallelism and enable efficient parallel computation on heterogeneous platforms. In each iteration, we use our paradigm selection approach to select the appropriate paradigm for each partition. Further, we map our hybrid algorithm onto a stateof- the-art heterogeneous platform which integrates a multi-core CPU and a Field-Programmable Gate Array (FPGA) in a cache coherent fashion. We use our design methodology to accelerate two fundamental graph algorithms, breadth-first search (BFS) and single-source shortest path (SSSP). Experimental results show that our CPU-FPGA co-processing achieves up to 1.5× (1.9×) speedup for BFS (SSSP) compared with optimized baseline designs. Compared with the state-of-the-art FPGA-based designs, our design achieves up to 4.0× (4.2×) throughput improvement for BFS (SSSP). Compared with a state-of-the-art multi-core design, our design demonstrates up to 1.5× (1.8×) speedup for BFS (SSSP).","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121828993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.13
J. Aliaga, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí
An important number of scientific and engineering problems currently require the solution of large and sparse linear systems of equations. In previous work, we applied a GPU accelerator to the solution of sparse linear systems of moderate dimension via ILUPACK, showing important reductions in the execution time while maintaining the quality of the solution. Unfortunately, the use of GPUs attached to only one compute node strongly limits the memory available to solve the systems, and thus the size of the problems that can be tackled with this approach.In this work we introduce a distributed–parallel version of ILUPACK that overcomes these limitations. The results of the evaluation show that the inclusion of multiple GPUs, located on distinct nodes of a cluster, yields relevant reductions in the execution time for large problems and, more importantly, allows to increase the dimension of the problems, showing interesting scaling properties.
{"title":"Overcoming Memory-Capacity Constraints in the Use of ILUPACK on Graphics Processors","authors":"J. Aliaga, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí","doi":"10.1109/SBAC-PAD.2017.13","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.13","url":null,"abstract":"An important number of scientific and engineering problems currently require the solution of large and sparse linear systems of equations. In previous work, we applied a GPU accelerator to the solution of sparse linear systems of moderate dimension via ILUPACK, showing important reductions in the execution time while maintaining the quality of the solution. Unfortunately, the use of GPUs attached to only one compute node strongly limits the memory available to solve the systems, and thus the size of the problems that can be tackled with this approach.In this work we introduce a distributed–parallel version of ILUPACK that overcomes these limitations. The results of the evaluation show that the inclusion of multiple GPUs, located on distinct nodes of a cluster, yields relevant reductions in the execution time for large problems and, more importantly, allows to increase the dimension of the problems, showing interesting scaling properties.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114487064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}