2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文中文

A User-Level Scheduling Framework for BoT Applications on Private Clouds 私有云上BoT应用的用户级调度框架

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.18

Maicon Anca dos Santos, A. R. D. Bois, G. H. Cavalheiro

This paper presents a high level model to describe bag of tasks (BoT) applications and a framework to evaluate user level approaches to scheduler BoTs on coarser works units. The scheduler consolidates the load of the tasks in a given number of virtual machines (VMs) providing the estimated makespan. The framework allows to change the policy of tasks selection in order to compare the length of the scheduling produced giving a limited number of VMs. The framework has as input a BoT description and produces for each VM its trace of processing load. This paper validates the BoT model and the proposed framework with a performance assessment. In our case studies, the output of the framework is submitted to a real OpenStack based IaaS infrastructure. The results show that the makespan can be reduced by grouping tasks in coarse units of loads.

本文提出了一个描述任务包(BoT)应用程序的高级模型和一个框架，用于评估在较粗的工作单元上调度BoT的用户级方法。调度器在给定数量的虚拟机(vm)中合并任务负载，提供估计的完工时间。该框架允许更改任务选择的策略，以便在给定有限数量的vm的情况下比较调度的长度。该框架将BoT描述作为输入，并为每个VM生成其处理负载的跟踪。本文通过绩效评估验证了BoT模型和提出的框架。在我们的案例研究中，框架的输出被提交到一个真正的基于OpenStack的IaaS基础设施。结果表明，通过将任务分组为粗糙的负载单元，可以减小最大作业时间。

引用次数: 0

SEDEA: A Sensible Approach to Account DRAM Energy in Multicore Systems SEDEA:多核系统中计算DRAM能量的合理方法

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.17

Qixiao Liu, Miquel Moretó, J. Abella, F. Cazorla, M. Valero

As the energy cost in todays computing systems keeps increasing, measuring the energy becomes crucial in many scenarios. For instance, due to the fact that the operational cost of datacenters largely depends on the energy consumed by the applications executed, end users should be charged for the energy consumed, which requires a fair and consistent energy measuring approach. However, the use of multicore system complicates per-task energy measurement as the increased Thread Level Parallelism (TLP) allows several tasks to run simultaneously sharing resources. Therefore, the energy usage of each task is hard to determine due to interleaved activities and mutual interferences. To this end, Per-Task Energy Metering (PTEM) has been proposed to measure the actual energy of each task based on their resource utilization in a workload. However, the measured energy depends on the interferences from co-running tasks sharing the resources, and thus fails to provide the consistency across executions. Therefore, Sensible Energy Accounting (SEA) has been proposed to deliver an abstraction of the energy consumption based on a particular allocation of resources to a task.In this work we provide a realization of SEA for the DRAM memory system, SEDEA, where we account a task for the DRAM energy it would have consumed when running in isolation with a fraction of the on-chip shared cache. SEDEA is a mechanism to sensibly account for the DRAM energy of a task based on predicting its memory behavior. Our results show that SEDEA provides accurate estimates, yet with low-cost, beating existing per-task energy models, which do not target accounting energy in multicore system. We also provide a use case showing that SEDEA can be used to guide shared cache and memory bank partition schemes to save energy.

随着当今计算系统的能源成本不断增加，测量能源在许多情况下变得至关重要。例如，由于数据中心的运营成本在很大程度上取决于所执行的应用程序所消耗的能源，因此应该向最终用户收取所消耗的能源费用，这需要一种公平和一致的能源测量方法。然而，多核系统的使用使每个任务的能量测量变得复杂，因为增加的线程级别并行性(TLP)允许多个任务同时运行共享资源。因此，由于各个任务的活动相互交错、相互干扰，使得各个任务的能耗难以确定。为此，提出了按任务能量计量(PTEM)，根据工作负载中每个任务的资源利用率来测量其实际能量。然而，测量的能量依赖于共享资源的共同运行任务的干扰，因此无法提供跨执行的一致性。因此，提出了合理能源核算(SEA)，以提供基于特定资源分配到任务的能源消耗的抽象。在这项工作中，我们为DRAM存储系统SEDEA提供了SEA的实现，在SEDEA中，我们将一个任务计算为在与片上共享缓存的一小部分隔离运行时消耗的DRAM能量。SEDEA是一种基于预测任务的内存行为来合理计算任务的DRAM能量的机制。我们的研究结果表明，SEDEA提供了准确的估计，但成本低，优于现有的每任务能量模型，这些模型不针对多核系统中的会计能量。我们还提供了一个用例，说明SEDEA可以用于指导共享缓存和内存库分区方案以节省能源。

{"title":"SEDEA: A Sensible Approach to Account DRAM Energy in Multicore Systems","authors":"Qixiao Liu, Miquel Moretó, J. Abella, F. Cazorla, M. Valero","doi":"10.1109/SBAC-PAD.2017.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.17","url":null,"abstract":"As the energy cost in todays computing systems keeps increasing, measuring the energy becomes crucial in many scenarios. For instance, due to the fact that the operational cost of datacenters largely depends on the energy consumed by the applications executed, end users should be charged for the energy consumed, which requires a fair and consistent energy measuring approach. However, the use of multicore system complicates per-task energy measurement as the increased Thread Level Parallelism (TLP) allows several tasks to run simultaneously sharing resources. Therefore, the energy usage of each task is hard to determine due to interleaved activities and mutual interferences. To this end, Per-Task Energy Metering (PTEM) has been proposed to measure the actual energy of each task based on their resource utilization in a workload. However, the measured energy depends on the interferences from co-running tasks sharing the resources, and thus fails to provide the consistency across executions. Therefore, Sensible Energy Accounting (SEA) has been proposed to deliver an abstraction of the energy consumption based on a particular allocation of resources to a task.In this work we provide a realization of SEA for the DRAM memory system, SEDEA, where we account a task for the DRAM energy it would have consumed when running in isolation with a fraction of the on-chip shared cache. SEDEA is a mechanism to sensibly account for the DRAM energy of a task based on predicting its memory behavior. Our results show that SEDEA provides accurate estimates, yet with low-cost, beating existing per-task energy models, which do not target accounting energy in multicore system. We also provide a use case showing that SEDEA can be used to guide shared cache and memory bank partition schemes to save energy.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133436356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Global Snapshot of a Distributed System Running on Virtual Machines 运行在虚拟机上的分布式系统全局快照

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.29

Carlos E. Gómez, Harold E. Castro, Carlos A. Varela

Recently, a new concept called desktop cloud emerged, which was developed to offer cloud computing services on non-dedicated resources. Similarly to cloud computing, desktop clouds are based on virtualization, and like other computational systems, may experience faults at any time. As a consequence, reliability has become a concern for researchers. Fault-tolerance strategies focused on independent virtual machines include snapshots (checkpoints) to resume the execution from a healthy state of a virtual machine on the same or another host, which is trivial because hypervisors provide this function. However, it is not trivial to obtain a global snapshot of a distributed system formed by applications that communicate among them because the concept of global clock does not exist, so it can not be guaranteed that snapshots of each VM will be taken at the same time. Therefore, some protocol is needed to coordinate the participants to obtain a global snapshot. In this paper, we propose a global snapshot protocol called UnaCloud Snapshot for its application in the context of desktop clouds over TCP/IP networks. That differs from other proposals that use a virtual network to inspect and manipulate the traffic circulating among virtual machines making it difficult to apply them to more realistic environments. We obtain a consistent global snapshot for a general distributed system running on virtual machines that maintains the semantics of the system without modifying applications running on virtual machines or hypervisors. A first prototype was developed and the preliminary results of our evaluation are presented.

最近，出现了一个名为桌面云的新概念，它是为了在非专用资源上提供云计算服务而开发的。与云计算类似，桌面云基于虚拟化，与其他计算系统一样，桌面云可能随时出现故障。因此，可靠性已成为研究人员关注的问题。专注于独立虚拟机的容错策略包括快照(检查点)，用于从同一台或另一台主机上的虚拟机的健康状态恢复执行，这并不重要，因为管理程序提供了此功能。但是，要获取由应用程序之间通信形成的分布式系统的全局快照并不是一件容易的事情，因为全局时钟的概念并不存在，所以不能保证每个VM同时进行快照。因此，需要一些协议来协调参与者以获得全局快照。在本文中，我们提出了一种名为UnaCloud snapshot的全局快照协议，用于TCP/IP网络上的桌面云环境。这与其他使用虚拟网络来检查和控制虚拟机之间流通的流量的建议不同，这使得将它们应用于更现实的环境变得困难。我们获得了运行在虚拟机上的通用分布式系统的一致全局快照，该快照在不修改运行在虚拟机或管理程序上的应用程序的情况下维护了系统的语义。开发了第一个原型，并给出了我们评估的初步结果。

{"title":"Global Snapshot of a Distributed System Running on Virtual Machines","authors":"Carlos E. Gómez, Harold E. Castro, Carlos A. Varela","doi":"10.1109/SBAC-PAD.2017.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.29","url":null,"abstract":"Recently, a new concept called desktop cloud emerged, which was developed to offer cloud computing services on non-dedicated resources. Similarly to cloud computing, desktop clouds are based on virtualization, and like other computational systems, may experience faults at any time. As a consequence, reliability has become a concern for researchers. Fault-tolerance strategies focused on independent virtual machines include snapshots (checkpoints) to resume the execution from a healthy state of a virtual machine on the same or another host, which is trivial because hypervisors provide this function. However, it is not trivial to obtain a global snapshot of a distributed system formed by applications that communicate among them because the concept of global clock does not exist, so it can not be guaranteed that snapshots of each VM will be taken at the same time. Therefore, some protocol is needed to coordinate the participants to obtain a global snapshot. In this paper, we propose a global snapshot protocol called UnaCloud Snapshot for its application in the context of desktop clouds over TCP/IP networks. That differs from other proposals that use a virtual network to inspect and manipulate the traffic circulating among virtual machines making it difficult to apply them to more realistic environments. We obtain a consistent global snapshot for a general distributed system running on virtual machines that maintains the semantics of the system without modifying applications running on virtual machines or hypervisors. A first prototype was developed and the preliminary results of our evaluation are presented.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123438823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Towards a Lock-Free, Fixed Size and Persistent Hash Map Design 迈向无锁、固定大小和持久哈希映射设计

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.26

M. Areias, Ricardo Rocha

Hash tries are a trie-based data structure with nearly ideal characteristics for the implementation of hash maps. In this paper, we present a novel, simple and scalable hash trie map design that fully supports the concurrent search, insert and remove operations on hash maps. To the best of our knowledge, our proposal is the first concurrent hash map design that puts together the following characteristics: (i) be lock-free; (ii) use fixed size data structures; and (iii) maintain the access to all internal data structures as persistent memory references. Experimental results show that our proposal is quite competitive when compared against other state-of-the-art proposals implemented in Java. Its design is modular enough to allow different types of configurations aimed for different performances in memory usage and execution time.

哈希尝试是一种基于尝试的数据结构，具有实现哈希映射的近乎理想的特征。在本文中，我们提出了一种新颖的、简单的、可扩展的哈希树映射设计，它完全支持哈希映射上的并发搜索、插入和删除操作。据我们所知，我们的提议是第一个将以下特征结合在一起的并发哈希映射设计:(i)无锁;(ii)使用固定大小的数据结构;(iii)保持对所有内部数据结构作为持久内存引用的访问。实验结果表明，与Java中实现的其他最先进的提案相比，我们的提案具有相当的竞争力。它的设计是模块化的，允许针对内存使用和执行时间的不同性能进行不同类型的配置。

引用次数: 6

A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs 基于机器学习的异构cpu性能预测与调度方法

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.23

Daniel Nemirovsky, Tugberk Arkose, Nikola Marković, M. Nemirovsky, O. Unsal, A. Cristal

As heterogeneous systems become more ubiquitous, computer architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers seeking to improve system performance. Recent advances in machine learning techniques including artificial neural network models have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput.In this paper we propose a unique throughput maximizing heterogeneous CPU scheduling model that uses machine learning to predict the performance of multiple threads on diverse system resources at the scheduling quantum granularity. We demonstrate how lightweight artificial neural networks (ANNs) can provide highly accurate performance predictions for a diverse set of applications thereby helping to improve heterogeneous scheduling efficiency. We show that online training is capable of increasing prediction accuracy but deepening the complexity of the ANNs can result in diminishing returns. Notably, our approach yields 25% to 31% throughput improvements over conventional heterogeneous schedulers for CPU and memory intensive applications.

随着异构系统变得越来越普遍，计算机架构师将需要开发能够利用计算资源多样性的新型CPU调度技术。准确地估计应用程序在不同异构资源上的性能可以为寻求提高系统性能的异构调度器提供显著的优势。包括人工神经网络模型在内的机器学习技术的最新进展导致了各种领域强大而实用的预测模型的发展。然而，到目前为止，为了最大限度地提高系统吞吐量，在使用机器学习进行异构调度方面还没有取得重大飞跃。在本文中，我们提出了一种独特的吞吐量最大化异构CPU调度模型，该模型使用机器学习在调度量子粒度上预测不同系统资源上多线程的性能。我们展示了轻量级人工神经网络(ann)如何为各种应用程序提供高度准确的性能预测，从而有助于提高异构调度效率。我们表明，在线训练能够提高预测精度，但加深人工神经网络的复杂性会导致收益递减。值得注意的是，对于CPU和内存密集型应用程序，我们的方法比传统的异构调度器的吞吐量提高了25%到31%。

{"title":"A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs","authors":"Daniel Nemirovsky, Tugberk Arkose, Nikola Marković, M. Nemirovsky, O. Unsal, A. Cristal","doi":"10.1109/SBAC-PAD.2017.23","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.23","url":null,"abstract":"As heterogeneous systems become more ubiquitous, computer architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers seeking to improve system performance. Recent advances in machine learning techniques including artificial neural network models have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput.In this paper we propose a unique throughput maximizing heterogeneous CPU scheduling model that uses machine learning to predict the performance of multiple threads on diverse system resources at the scheduling quantum granularity. We demonstrate how lightweight artificial neural networks (ANNs) can provide highly accurate performance predictions for a diverse set of applications thereby helping to improve heterogeneous scheduling efficiency. We show that online training is capable of increasing prediction accuracy but deepening the complexity of the ANNs can result in diminishing returns. Notably, our approach yields 25% to 31% throughput improvements over conventional heterogeneous schedulers for CPU and memory intensive applications.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126874372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

The Case for Flexible ISAs: Unleashing Hardware and Software 灵活isa的案例:释放硬件和软件

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.16

R. Auler, E. Borin

For a long time the Instruction Set Architecture (ISA) has been the firm contract between software and hardware. This firm contract plays an important role by decoupling the development of software from hardware micro-architectural features, enabling both to evolve independently. Nonetheless, it also condemns the ISA to become larger, more cluttered and inefficient as new instructions are incorporated over the years and deprecated instructions are left untouched to keep legacy compatibility. In this work we propose OpenISA, a flexible ISA that enables both the software and the hardware to evolve independently and discuss how OpenISA 1.0 was designed to enable efficient OpenISA software emulation on alien ISAs, which is key to free the user from hardware lock-ins. Our results show that software compiled to OpenISA can be latter emulated on x86 and ARM processors with very little overhead achieving near native performance, under 10% for the majority of programs.

长期以来，指令集体系结构(ISA)一直是软件和硬件之间的牢固契约。通过将软件开发与硬件微体系结构特性分离，使两者能够独立发展，这种坚定的契约发挥了重要作用。尽管如此，它也使ISA变得更大、更混乱、效率更低，因为多年来不断合并新的指令，而不使用已弃用的指令以保持遗留兼容性。在这项工作中，我们提出了OpenISA，一个灵活的ISA，使软件和硬件能够独立发展，并讨论了如何设计OpenISA 1.0来实现在外部ISA上有效的OpenISA软件仿真，这是将用户从硬件锁定中解放出来的关键。我们的结果表明，编译成OpenISA的软件可以在x86和ARM处理器上进行模拟，开销很少，达到接近本机的性能，大多数程序的性能低于10%。

引用次数: 7

Towards a Deterministic Fine-Grained Task Ordering Using Multi-Versioned Memory 基于多版本内存的确定性细粒度任务排序

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.21

Eran Gilad, Tehila Mayzels, Elazar Raab, M. Oskin, Yoav Etsion

Task-based programming models aim to simplify parallel programming. A runtime system schedules tasks to execute on cores. An essential component of this runtime is to track and manage dependencies between tasks. A typical approach is to rely on programmers to annotate tasks and data structures, essentially manually specifying the input and output of each task. As such, dependencies are associated with named program objects, making this approach problematic for pointer-based data structures. Furthermore, because the runtime system must track these dependencies, for efficient runtime performance the read and write sets should be kept small.We presume a memory system with architecturally visible support for multiple versions of data stored at the same program address. This paper proposes and evaluates a task-based execution model that uses this versioned memory system to deterministically parallelize sequential code. We have built a task-based runtime layer that uses this type of memory system for dependence tracking. We demonstrate the advantages of the proposed model by parallelizing pointer-heavy code, obtaining speedup of up to 19x on a 32-core system.

基于任务的编程模型旨在简化并行编程。运行时系统安排任务在内核上执行。此运行时的一个重要组件是跟踪和管理任务之间的依赖关系。一种典型的方法是依靠程序员来注释任务和数据结构，本质上是手动指定每个任务的输入和输出。因此，依赖关系与命名的程序对象相关联，使得这种方法对于基于指针的数据结构有问题。此外，由于运行时系统必须跟踪这些依赖关系，因此为了有效的运行时性能，读写集应该保持较小。我们假设一个内存系统在体系结构上支持存储在同一程序地址的多个版本的数据。本文提出并评估了一种基于任务的执行模型，该模型使用此版本化存储系统来确定并行化顺序代码。我们已经构建了一个基于任务的运行时层，它使用这种类型的内存系统进行依赖跟踪。我们通过并行化指针密集的代码来证明所提出模型的优点，在32核系统上获得了高达19倍的加速。

{"title":"Towards a Deterministic Fine-Grained Task Ordering Using Multi-Versioned Memory","authors":"Eran Gilad, Tehila Mayzels, Elazar Raab, M. Oskin, Yoav Etsion","doi":"10.1109/SBAC-PAD.2017.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.21","url":null,"abstract":"Task-based programming models aim to simplify parallel programming. A runtime system schedules tasks to execute on cores. An essential component of this runtime is to track and manage dependencies between tasks. A typical approach is to rely on programmers to annotate tasks and data structures, essentially manually specifying the input and output of each task. As such, dependencies are associated with named program objects, making this approach problematic for pointer-based data structures. Furthermore, because the runtime system must track these dependencies, for efficient runtime performance the read and write sets should be kept small.We presume a memory system with architecturally visible support for multiple versions of data stored at the same program address. This paper proposes and evaluates a task-based execution model that uses this versioned memory system to deterministically parallelize sequential code. We have built a task-based runtime layer that uses this type of memory system for dependence tracking. We demonstrate the advantages of the proposed model by parallelizing pointer-heavy code, obtaining speedup of up to 19x on a 32-core system.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124764548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Data Coherence Analysis and Optimization for Heterogeneous Computing 异构计算的数据一致性分析与优化

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.9

R. Sousa, M. Pereira, Fernando Magno Quintão Pereira, G. Araújo

Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. DCO was implemented in AClang LLVM (www.aclang.org) a compiler capable of translating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding the complexity of directly programming in OpenCL. Experimental results using DCA and DCO in AClang to compile programs from the Parboil, Polybench and Rodinia benchmarks reveal performance speed-ups of up to 5.25x on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 2.03x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit.

尽管异构计算已经实现了令人印象深刻的程序加速，但是了解目标设备的体系结构仍然是获得全部硬件优势的关键。这种架构的编程是复杂的，通常是通过专门的语言(例如CUDA, OpenCL)来完成的。移动和保持主机/设备数据一致的成本可能很容易消除加速所带来的任何性能收益。尽管这个问题已经在多核架构中得到了广泛的研究，并且最近通过CUDA8解决了离散gpu的问题，但对于像移动设备(例如ARM Mali)中那样的集成CPU/ gpu架构，还没有通用的解决方案。本文提出了数据一致性分析(DCA)，这是一组两个数据流分析，用于确定主机/设备在每个程序点上如何使用变量。它还介绍了数据一致性优化(DCO)，一种使用DCA信息的代码优化技术:(a)在主机和设备之间分配OpenCL共享缓冲区;(b)在程序点中插入适当的OpenCL函数调用，以尽量减少数据一致性操作的数量。DCO是在能够翻译openmp4的编译器AClang LLVM (www.aclang.org)中实现的。X注释循环到OpenCL内核，从而隐藏了直接在OpenCL中编程的复杂性。使用AClang中的DCA和DCO来编译Parboil, Polybench和Rodinia基准测试中的程序的实验结果显示，在配备ARM Mali-T880 MP12 GPU的Exynos 8890八核CPU上，性能加速高达5.25倍，在配备英特尔Iris GPU单元的2.4 GHz双核英特尔酷睿i5处理器上，性能加速高达2.09倍。

{"title":"Data Coherence Analysis and Optimization for Heterogeneous Computing","authors":"R. Sousa, M. Pereira, Fernando Magno Quintão Pereira, G. Araújo","doi":"10.1109/SBAC-PAD.2017.9","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.9","url":null,"abstract":"Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. DCO was implemented in AClang LLVM (www.aclang.org) a compiler capable of translating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding the complexity of directly programming in OpenCL. Experimental results using DCA and DCO in AClang to compile programs from the Parboil, Polybench and Rodinia benchmarks reveal performance speed-ups of up to 5.25x on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 2.03x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129115541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform CPU-FPGA异构平台加速图形分析

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.25

Shijie Zhou, V. Prasanna

Hardware accelerators for graph analytics have gained increasing interest. Vertex-centric and edge-centric paradigms are widely used to design graph analytics accelerators. However, both of them have notable drawbacks: vertex-centric paradigm requires random memory accesses to traverse edges and edge-centric paradigm results in redundant edge traversals. In this paper, we explore the tradeoffs between vertex-centric and edge-centric paradigms and propose a hybrid algorithm which dynamically selects between them during the execution. We introduce the notion of active vertex ratio, based on which we develop a simple but efficient paradigm selection approach. We develop a hybrid data structure to concurrently support vertex-centric and edge-centric paradigms. Based on the hybrid data structure, we propose a graph partitioning scheme to increase parallelism and enable efficient parallel computation on heterogeneous platforms. In each iteration, we use our paradigm selection approach to select the appropriate paradigm for each partition. Further, we map our hybrid algorithm onto a stateof- the-art heterogeneous platform which integrates a multi-core CPU and a Field-Programmable Gate Array (FPGA) in a cache coherent fashion. We use our design methodology to accelerate two fundamental graph algorithms, breadth-first search (BFS) and single-source shortest path (SSSP). Experimental results show that our CPU-FPGA co-processing achieves up to 1.5× (1.9×) speedup for BFS (SSSP) compared with optimized baseline designs. Compared with the state-of-the-art FPGA-based designs, our design achieves up to 4.0× (4.2×) throughput improvement for BFS (SSSP). Compared with a state-of-the-art multi-core design, our design demonstrates up to 1.5× (1.8×) speedup for BFS (SSSP).

用于图形分析的硬件加速器获得了越来越多的兴趣。以顶点为中心和以边缘为中心的范式被广泛用于设计图形分析加速器。然而，它们都有明显的缺点:以顶点为中心的范式需要随机内存访问来遍历边缘，而以边缘为中心的范式会导致冗余的边缘遍历。本文探讨了以顶点为中心和以边缘为中心范式之间的权衡，并提出了一种在执行过程中动态选择两者的混合算法。引入主动顶点比的概念，在此基础上提出了一种简单有效的范式选择方法。我们开发了一种混合数据结构来同时支持以顶点为中心和以边缘为中心的范式。在混合数据结构的基础上，提出了一种图形划分方案来提高并行性，实现异构平台上的高效并行计算。在每次迭代中，我们使用范式选择方法为每个分区选择适当的范式。此外，我们将混合算法映射到最先进的异构平台上，该平台以缓存一致的方式集成了多核CPU和现场可编程门阵列(FPGA)。我们使用我们的设计方法来加速两种基本的图算法，广度优先搜索(BFS)和单源最短路径(SSSP)。实验结果表明，我们的CPU-FPGA协同处理达到1.5×与优化的基线设计相比，BFS (SSSP)的(1.9×)加速。与最先进的基于fpga的设计相比，我们的设计实现了4.0×(4.2×) BFS (SSSP)的吞吐量改进。与最先进的多核设计相比，我们的设计显示高达1.5×(1.8×)加速BFS (SSSP)。

{"title":"Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform","authors":"Shijie Zhou, V. Prasanna","doi":"10.1109/SBAC-PAD.2017.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.25","url":null,"abstract":"Hardware accelerators for graph analytics have gained increasing interest. Vertex-centric and edge-centric paradigms are widely used to design graph analytics accelerators. However, both of them have notable drawbacks: vertex-centric paradigm requires random memory accesses to traverse edges and edge-centric paradigm results in redundant edge traversals. In this paper, we explore the tradeoffs between vertex-centric and edge-centric paradigms and propose a hybrid algorithm which dynamically selects between them during the execution. We introduce the notion of active vertex ratio, based on which we develop a simple but efficient paradigm selection approach. We develop a hybrid data structure to concurrently support vertex-centric and edge-centric paradigms. Based on the hybrid data structure, we propose a graph partitioning scheme to increase parallelism and enable efficient parallel computation on heterogeneous platforms. In each iteration, we use our paradigm selection approach to select the appropriate paradigm for each partition. Further, we map our hybrid algorithm onto a stateof- the-art heterogeneous platform which integrates a multi-core CPU and a Field-Programmable Gate Array (FPGA) in a cache coherent fashion. We use our design methodology to accelerate two fundamental graph algorithms, breadth-first search (BFS) and single-source shortest path (SSSP). Experimental results show that our CPU-FPGA co-processing achieves up to 1.5× (1.9×) speedup for BFS (SSSP) compared with optimized baseline designs. Compared with the state-of-the-art FPGA-based designs, our design achieves up to 4.0× (4.2×) throughput improvement for BFS (SSSP). Compared with a state-of-the-art multi-core design, our design demonstrates up to 1.5× (1.8×) speedup for BFS (SSSP).","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121828993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Overcoming Memory-Capacity Constraints in the Use of ILUPACK on Graphics Processors 克服在图形处理器上使用ILUPACK的内存容量限制

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.13

J. Aliaga, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí

An important number of scientific and engineering problems currently require the solution of large and sparse linear systems of equations. In previous work, we applied a GPU accelerator to the solution of sparse linear systems of moderate dimension via ILUPACK, showing important reductions in the execution time while maintaining the quality of the solution. Unfortunately, the use of GPUs attached to only one compute node strongly limits the memory available to solve the systems, and thus the size of the problems that can be tackled with this approach.In this work we introduce a distributed–parallel version of ILUPACK that overcomes these limitations. The results of the evaluation show that the inclusion of multiple GPUs, located on distinct nodes of a cluster, yields relevant reductions in the execution time for large problems and, more importantly, allows to increase the dimension of the problems, showing interesting scaling properties.

目前，许多重要的科学和工程问题都需要求解大型稀疏线性方程组。在之前的工作中，我们通过ILUPACK将GPU加速器应用于中等维数的稀疏线性系统的求解，在保持解决方案质量的同时显着减少了执行时间。不幸的是，使用仅连接到一个计算节点的gpu严重限制了用于解决系统的可用内存，从而限制了使用这种方法可以解决的问题的大小。在这项工作中，我们介绍了一个分布式–并行版本的ILUPACK，克服了这些限制。评估的结果表明，包含多个gpu，位于集群的不同节点上，可以减少大型问题的执行时间，更重要的是，可以增加问题的维度，显示出有趣的缩放特性。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀