2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文中文

Deep Learning-Based Nuclei Segmentation of Cleared Brain Tissue 基于深度学习的清除脑组织核分割

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916435

Pooya Khorrami, K. Brady, Mark Hernandez, L. Gjesteby, S. Burke, Damon G. Lamb, Matthew A. Melton, K. Otto, L. Brattain

We present a deep learning approach for nuclei segmentation at scale. Our algorithm aims to address the challenge of segmentation in dense scenes with limited annotated data available. Annotation in this domain is highly manual in nature, requiring time-consuming markup of the neuron and extensive expertise, and often results in errors. For these reasons, the approach under consideration employs methods adopted from transfer learning. This approach can also be extended to segment other components of the neurons.

我们提出了一种大规模核分割的深度学习方法。我们的算法旨在解决具有有限注释数据的密集场景中的分割挑战。该领域的注释本质上是高度手动的，需要耗费时间的神经元标记和广泛的专业知识，并且经常导致错误。基于这些原因，本文所考虑的方法采用了迁移学习的方法。这种方法也可以扩展到分割神经元的其他组成部分。

引用次数: 2

Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments : (Update on Static Graph Challenge) 基于线性代数的三角形计数在异构环境下的细粒度任务处理:(更新静态图挑战)

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916233

Abdurrahman Yasar, S. Rajamanickam, Jonathan W. Berry, Michael M. Wolf, Jeffrey S. Young, Ümit V. Çatalyürek

Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year’s graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.

三角形计数是一个典型的图问题，它显示了使用算法技术提高图算法性能和将图算法应用于新架构的挑战。在本文中，我们描述了对三角形计数问题的线性代数公式的一个更新。我们的新方法依赖于基于tile布局的细粒度任务。我们将这种基于任务的算法应用于异构架构(cpu和gpu)，比去年的图形挑战提交速度提高了10.8倍。当图形驻留在GPU上时，这种实现还可以在GPU加速器上实现twitter(3.7秒)和friendster(1.8秒)等现实世界图形在发布时已知的最快内核时间。这比以前最先进的gpu三角计数提高了1.7和1.2倍。我们还通过重叠计算和图形与gpu的通信提高了端到端执行时间。在端到端执行时间方面，由于开销成本非常低，我们的实现也实现了最快的端到端时间。

{"title":"Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments : (Update on Static Graph Challenge)","authors":"Abdurrahman Yasar, S. Rajamanickam, Jonathan W. Berry, Michael M. Wolf, Jeffrey S. Young, Ümit V. Çatalyürek","doi":"10.1109/HPEC.2019.8916233","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916233","url":null,"abstract":"Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year’s graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116448181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Survey of Attacks and Defenses on Edge-Deployed Neural Networks 边缘部署神经网络攻击与防御研究综述

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916519

Mihailo Isakov, V. Gadepally, K. Gettings, M. Kinsy

Deep Neural Network (DNN) workloads are quickly moving from datacenters onto edge devices, for latency, privacy, or energy reasons. While datacenter networks can be protected using conventional cybersecurity measures, edge neural networks bring a host of new security challenges. Unlike classic IoT applications, edge neural networks are typically very compute and memory intensive, their execution is data-independent, and they are robust to noise and faults. Neural network models may be very expensive to develop, and can potentially reveal information about the private data they were trained on, requiring special care in distribution. The hidden states and outputs of the network can also be used in reconstructing user inputs, potentially violating users’ privacy. Furthermore, neural networks are vulnerable to adversarial attacks, which may cause misclassifications and violate the integrity of the output. These properties add challenges when securing edge-deployed DNNs, requiring new considerations, threat models, priorities, and approaches in securely and privately deploying DNNs to the edge. In this work, we cover the landscape of attacks on, and defenses, of neural networks deployed in edge devices and provide a taxonomy of attacks and defenses targeting edge DNNs.

由于延迟、隐私或能源原因，深度神经网络(DNN)工作负载正迅速从数据中心转移到边缘设备上。虽然可以使用传统的网络安全措施来保护数据中心网络，但边缘神经网络带来了许多新的安全挑战。与经典的物联网应用不同，边缘神经网络通常需要大量的计算和内存，它们的执行与数据无关，并且对噪声和故障具有鲁棒性。神经网络模型的开发可能非常昂贵，并且可能会泄露有关它们所训练的私人数据的信息，在分发时需要特别小心。网络的隐藏状态和输出也可以用于重建用户输入，这可能会侵犯用户的隐私。此外，神经网络容易受到对抗性攻击，这可能导致错误分类并破坏输出的完整性。这些特性在保护边缘部署的dnn时增加了挑战，需要新的考虑因素、威胁模型、优先级和方法来安全和私密地将dnn部署到边缘。在这项工作中，我们介绍了在边缘设备中部署的神经网络的攻击和防御情况，并提供了针对边缘dnn的攻击和防御分类。

{"title":"Survey of Attacks and Defenses on Edge-Deployed Neural Networks","authors":"Mihailo Isakov, V. Gadepally, K. Gettings, M. Kinsy","doi":"10.1109/HPEC.2019.8916519","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916519","url":null,"abstract":"Deep Neural Network (DNN) workloads are quickly moving from datacenters onto edge devices, for latency, privacy, or energy reasons. While datacenter networks can be protected using conventional cybersecurity measures, edge neural networks bring a host of new security challenges. Unlike classic IoT applications, edge neural networks are typically very compute and memory intensive, their execution is data-independent, and they are robust to noise and faults. Neural network models may be very expensive to develop, and can potentially reveal information about the private data they were trained on, requiring special care in distribution. The hidden states and outputs of the network can also be used in reconstructing user inputs, potentially violating users’ privacy. Furthermore, neural networks are vulnerable to adversarial attacks, which may cause misclassifications and violate the integrity of the output. These properties add challenges when securing edge-deployed DNNs, requiring new considerations, threat models, priorities, and approaches in securely and privately deploying DNNs to the edge. In this work, we cover the landscape of attacks on, and defenses, of neural networks deployed in edge devices and provide a taxonomy of attacks and defenses targeting edge DNNs.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121849629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU GPU和CPU负载均衡的细粒度并行性研究

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916473

Mark P. Blanco, Tze Meng Low, Kyungjoo Kim

In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation.

在这项工作中，我们提出了对热切k -桁架的性能探索，这是k -桁架图算法的线性代数公式。我们通过提供细粒度并行方法来执行支持计算，解决了与对称三角形图中并行任务的负载不平衡相关的性能问题。这种方法还增加了可用的并行性，使其适合GPU执行。我们使用Kokkos中的实现演示了我们的细粒度并行方法，并在英特尔Skylake CPU和Nvidia Tesla V100 GPU上对它们进行了评估。总的来说，我们观察到1.261之间。由于我们的细粒度并行公式，CPU提高了48倍，GPU提高了9.97-16.92倍。

引用次数: 16

FPGA-Accelerated Spreading for Global Placement fpga加速全球布局的扩展

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916251

Shounak Dhar, L. Singhal, M. Iyer, D. Pan

Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with comparable quality when integrated into a state-of-the-art academic placer. We formulate the spreading problem as a system of fluid flows across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm [1], our new fluid-flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster.

在电子设计自动化设计实现流程中，放置占据了运行时的很大一部分。在现代工业和学术物理设计实现工具中，全局布局占用了总体布局运行时的很大一部分。许多这些全局布局解耦问题分为两个主要部分-数值优化和扩展。本文提出了一种新的大规模并行扩展算法，并在FPGA上实现了部分算法的加速。我们的算法产生的位置具有相当的质量，当整合到一个国家的最先进的学术placer。我们将扩散问题表述为储层流体流动系统，并从数学上证明了该表述作为连续时间系统求解时产生无循环流动。我们还提出了一种流动校正算法，使流动单调，减少细胞总位移，消除离散过程中可能出现的循环。本文提出的流校正算法在循环去除方面比以往的算法具有更好的时间复杂度。与我们之前发表的基于线性规划的传播算法[1]相比，我们新的基于流体流动的多线程传播算法速度提高了3.44倍，相应的fpga加速版本速度提高了5.15倍。

{"title":"FPGA-Accelerated Spreading for Global Placement","authors":"Shounak Dhar, L. Singhal, M. Iyer, D. Pan","doi":"10.1109/HPEC.2019.8916251","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916251","url":null,"abstract":"Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with comparable quality when integrated into a state-of-the-art academic placer. We formulate the spreading problem as a system of fluid flows across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm [1], our new fluid-flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121662223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer 通过通用运行时层实现异步任务运行时的自动并行化

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916294

Charles Jin, M. Baskaran, Benoît Meister, J. Springer

With the end of Moore’s law, asynchronous task-based parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability.This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to-end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0* (OpenMP) and 9.5* (Legion) using 64 threads.

随着摩尔定律的终结，基于异步任务的并行性作为一种并行编程范式得到了越来越多的支持，运行时系统提供了动态负载平衡、局部性和可伸缩性等优势。然而，近年来这类编程系统的数量激增，每个系统都有不同的性能权衡和运行时语义。因此，在这些系统之上开发应用程序不仅需要应用程序专业知识，还需要对运行时非常熟悉，这加剧了可编程性和可移植性的长期问题。这项工作为这一不断发展的景观做出了三个主要贡献。首先，我们扩展了一个多面体优化编译器，使用技术提取基于任务的并行性和数据管理，用于广泛的基于异步任务的运行时。其次，我们为基于异步任务的系统引入了一个通用的运行时层，该层具有默认情况下稀疏和平铺的数据和任务表示，可作为编译器后端的抽象目标。最后，我们使用OpenMP和Legion实现了这个通用层，展示了通用层的灵活性和可行性，并为基于异步任务的运行时的自动并行化提供了端到端路径。使用从深度学习到科学内核的广泛应用程序，我们使用64个线程获得23.0* (OpenMP)和9.5* (Legion)的几何平均加速。

{"title":"Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer","authors":"Charles Jin, M. Baskaran, Benoît Meister, J. Springer","doi":"10.1109/HPEC.2019.8916294","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916294","url":null,"abstract":"With the end of Moore’s law, asynchronous task-based parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability.This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to-end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0* (OpenMP) and 9.5* (Legion) using 64 threads.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Garbled Circuits in the Cloud using FPGA Enabled Nodes 使用FPGA使能节点的云中的乱码电路

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916407

Kai Huang, Mehmet Güngör, Xin Fang, Stratis Ioannidis, M. Leeser

Data privacy is an increasing concern in our interconnected world. Garbled circuits is an important approach used for Secure Function Evaluation (SFE); however it suffers from long garbling times. In this paper we present garbled circuits in the cloud using Amazon Web Services, and particularly Amazon F1 FPGA enabled nodes. We implement both garbler and evaluator in software, and show how F1 instances can accelerate the garbling process and rapidly adapt to several different applications. Experimental results, measured on AWS, indicate a 15 times speedup for garbling done using an FPGA. This results in total application speedup, including garbling, communications and evaluation, of close to three times over a large range of application sizes.

在我们这个互联互通的世界里，数据隐私日益受到关注。乱码电路是用于安全功能评估(SFE)的重要方法;然而，它遭受了长时间的乱码。在本文中，我们使用Amazon Web Services，特别是Amazon F1 FPGA支持的节点，展示了云中的乱码电路。我们在软件中实现了加码器和评估器，并展示了F1实例如何加速加码过程并快速适应几种不同的应用程序。在AWS上测量的实验结果表明，使用FPGA处理乱码的速度提高了15倍。这将导致整个应用程序的加速，包括乱码、通信和评估，在很大的应用程序大小范围内接近三倍。

引用次数: 14

Accelerating DNN Inference with GraphBLAS and the GPU 使用GraphBLAS和GPU加速DNN推理

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916498

Xiaoyun Wang, Zhongyi Lin, Carl Yang, John Douglas Owens

This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is $1.94 times $ faster than Suite-Sparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.

这项工作解决了2019年稀疏深度神经网络图挑战，并使用GraphBLAS编程模型实现了该挑战。我们用GraphBLAST (GraphBLAS在GPU上的实现)来演示我们的解决方案，并将其与SuiteSparse (GraphBLAS在CPU上的实现)进行比较。GraphBLAST的实现比Suite-Sparse快1.94倍;在GPU上提高性能的主要机会是更高性能的稀疏矩阵乘以稀疏矩阵(SpGEMM)内核。

引用次数: 11

Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture 集成CPU-GPU架构的异构缓存层次管理

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916239

Hao Wen, W. Zhang

Unlike the traditional CPU-GPU heterogeneous architecture where CPU and GPU have separate DRAM and memory address space, current heterogeneous CPU-GPU architectures integrate CPU and GPU in the same die and share the same last level cache (LLC) and memory. For the two-level cache hierarchy in which CPU and GPU have their own private L1 caches but share the LLC, conflict misses in the LLC between CPU and GPU may degrade both CPU and GPU performance. In addition, how the CPU and GPU memory requests flows (write back flow from L1 and cache fill flow from main memory) are managed may impact the performance. In this work, we study three different cache requests flow management policies. The first policy is selective GPU LLC fill, which selectively fills the GPU requests in the LLC. The second policy is selective GPU L1 write back, which selectively writes back GPU blocks in L1 cache to L2 cache. The final policy is a hybrid policy that combines the first two, and selectively replaces CPU blocks in the LLC. Our experimental results indicate that the third policy is the best of these three. On average, it can improve the CPU performance by about 10%, with the highest CPU performance improvement of 22%, with 0.8% averaged GPU performance overhead.

与传统的CPU-GPU异构架构不同，CPU和GPU有单独的DRAM和内存地址空间，当前的异构CPU-GPU架构将CPU和GPU集成在同一个die中，并共享相同的最后一级缓存(LLC)和内存。对于CPU和GPU各自拥有私有L1缓存但共享LLC的两级缓存结构，CPU和GPU之间的LLC中的冲突缺失可能会降低CPU和GPU的性能。此外，CPU和GPU内存请求流(来自L1的写回流和来自主存的缓存填充流)的管理方式可能会影响性能。在这项工作中，我们研究了三种不同的缓存请求流管理策略。第一种策略是选择性GPU LLC填充，有选择地填充LLC中的GPU请求。第二种策略是选择性GPU L1回写，有选择地将L1缓存中的GPU块回写到L2缓存中。最后一种策略是前两种策略的混合策略，并有选择地替换LLC中的CPU块。我们的实验结果表明，第三种策略是这三种策略中最好的。平均而言，它可以将CPU性能提高约10%，最高CPU性能提高22%，平均GPU性能开销为0.8%。

{"title":"Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture","authors":"Hao Wen, W. Zhang","doi":"10.1109/HPEC.2019.8916239","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916239","url":null,"abstract":"Unlike the traditional CPU-GPU heterogeneous architecture where CPU and GPU have separate DRAM and memory address space, current heterogeneous CPU-GPU architectures integrate CPU and GPU in the same die and share the same last level cache (LLC) and memory. For the two-level cache hierarchy in which CPU and GPU have their own private L1 caches but share the LLC, conflict misses in the LLC between CPU and GPU may degrade both CPU and GPU performance. In addition, how the CPU and GPU memory requests flows (write back flow from L1 and cache fill flow from main memory) are managed may impact the performance. In this work, we study three different cache requests flow management policies. The first policy is selective GPU LLC fill, which selectively fills the GPU requests in the LLC. The second policy is selective GPU L1 write back, which selectively writes back GPU blocks in L1 cache to L2 cache. The final policy is a hybrid policy that combines the first two, and selectively replaces CPU blocks in the LLC. Our experimental results indicate that the third policy is the best of these three. On average, it can improve the CPU performance by about 10%, with the highest CPU performance improvement of 22%, with 0.8% averaged GPU performance overhead.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121939566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DistTC: High Performance Distributed Triangle Counting DistTC:高性能分布式三角形计数

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916438

Loc Hoang, Vishwesh Jatala, Xuhao Chen, U. Agarwal, Roshan Dathathri, G. Gill, K. Pingali

We describe a novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph partitioning strategy that eliminates almost all inter-host communication during triangle counting. Experimental results show that this distributed triangle counting implementation can handle very large graphs such as clueweb12, which has almost one billion vertices and 37 billion edges, and it is up to 1.6× faster than TriCore, the 2018 Graph Challenge champion.

我们描述了一种新的多机器多gpu的三角形计数实现，它利用了一种新的应用程序无关的图分区策略，消除了三角形计数期间几乎所有的主机间通信。实验结果表明，这种分布式三角形计数实现可以处理非常大的图，如clueweb12，它有近10亿个顶点和370亿个边，比2018年图形挑战赛冠军TriCore的速度快1.6倍。

引用次数: 25

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE High Performance Extreme Computing Conference (HPEC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀