Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916435
Pooya Khorrami, K. Brady, Mark Hernandez, L. Gjesteby, S. Burke, Damon G. Lamb, Matthew A. Melton, K. Otto, L. Brattain
We present a deep learning approach for nuclei segmentation at scale. Our algorithm aims to address the challenge of segmentation in dense scenes with limited annotated data available. Annotation in this domain is highly manual in nature, requiring time-consuming markup of the neuron and extensive expertise, and often results in errors. For these reasons, the approach under consideration employs methods adopted from transfer learning. This approach can also be extended to segment other components of the neurons.
{"title":"Deep Learning-Based Nuclei Segmentation of Cleared Brain Tissue","authors":"Pooya Khorrami, K. Brady, Mark Hernandez, L. Gjesteby, S. Burke, Damon G. Lamb, Matthew A. Melton, K. Otto, L. Brattain","doi":"10.1109/HPEC.2019.8916435","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916435","url":null,"abstract":"We present a deep learning approach for nuclei segmentation at scale. Our algorithm aims to address the challenge of segmentation in dense scenes with limited annotated data available. Annotation in this domain is highly manual in nature, requiring time-consuming markup of the neuron and extensive expertise, and often results in errors. For these reasons, the approach under consideration employs methods adopted from transfer learning. This approach can also be extended to segment other components of the neurons.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128890811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916233
Abdurrahman Yasar, S. Rajamanickam, Jonathan W. Berry, Michael M. Wolf, Jeffrey S. Young, Ümit V. Çatalyürek
Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year’s graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.
{"title":"Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments : (Update on Static Graph Challenge)","authors":"Abdurrahman Yasar, S. Rajamanickam, Jonathan W. Berry, Michael M. Wolf, Jeffrey S. Young, Ümit V. Çatalyürek","doi":"10.1109/HPEC.2019.8916233","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916233","url":null,"abstract":"Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year’s graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116448181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916519
Mihailo Isakov, V. Gadepally, K. Gettings, M. Kinsy
Deep Neural Network (DNN) workloads are quickly moving from datacenters onto edge devices, for latency, privacy, or energy reasons. While datacenter networks can be protected using conventional cybersecurity measures, edge neural networks bring a host of new security challenges. Unlike classic IoT applications, edge neural networks are typically very compute and memory intensive, their execution is data-independent, and they are robust to noise and faults. Neural network models may be very expensive to develop, and can potentially reveal information about the private data they were trained on, requiring special care in distribution. The hidden states and outputs of the network can also be used in reconstructing user inputs, potentially violating users’ privacy. Furthermore, neural networks are vulnerable to adversarial attacks, which may cause misclassifications and violate the integrity of the output. These properties add challenges when securing edge-deployed DNNs, requiring new considerations, threat models, priorities, and approaches in securely and privately deploying DNNs to the edge. In this work, we cover the landscape of attacks on, and defenses, of neural networks deployed in edge devices and provide a taxonomy of attacks and defenses targeting edge DNNs.
{"title":"Survey of Attacks and Defenses on Edge-Deployed Neural Networks","authors":"Mihailo Isakov, V. Gadepally, K. Gettings, M. Kinsy","doi":"10.1109/HPEC.2019.8916519","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916519","url":null,"abstract":"Deep Neural Network (DNN) workloads are quickly moving from datacenters onto edge devices, for latency, privacy, or energy reasons. While datacenter networks can be protected using conventional cybersecurity measures, edge neural networks bring a host of new security challenges. Unlike classic IoT applications, edge neural networks are typically very compute and memory intensive, their execution is data-independent, and they are robust to noise and faults. Neural network models may be very expensive to develop, and can potentially reveal information about the private data they were trained on, requiring special care in distribution. The hidden states and outputs of the network can also be used in reconstructing user inputs, potentially violating users’ privacy. Furthermore, neural networks are vulnerable to adversarial attacks, which may cause misclassifications and violate the integrity of the output. These properties add challenges when securing edge-deployed DNNs, requiring new considerations, threat models, priorities, and approaches in securely and privately deploying DNNs to the edge. In this work, we cover the landscape of attacks on, and defenses, of neural networks deployed in edge devices and provide a taxonomy of attacks and defenses targeting edge DNNs.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121849629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916473
Mark P. Blanco, Tze Meng Low, Kyungjoo Kim
In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation.
在这项工作中,我们提出了对热切k -桁架的性能探索,这是k -桁架图算法的线性代数公式。我们通过提供细粒度并行方法来执行支持计算,解决了与对称三角形图中并行任务的负载不平衡相关的性能问题。这种方法还增加了可用的并行性,使其适合GPU执行。我们使用Kokkos中的实现演示了我们的细粒度并行方法,并在英特尔Skylake CPU和Nvidia Tesla V100 GPU上对它们进行了评估。总的来说,我们观察到1.261之间。由于我们的细粒度并行公式,CPU提高了48倍,GPU提高了9.97-16.92倍。
{"title":"Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU","authors":"Mark P. Blanco, Tze Meng Low, Kyungjoo Kim","doi":"10.1109/HPEC.2019.8916473","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916473","url":null,"abstract":"In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124093668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916251
Shounak Dhar, L. Singhal, M. Iyer, D. Pan
Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with comparable quality when integrated into a state-of-the-art academic placer. We formulate the spreading problem as a system of fluid flows across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm [1], our new fluid-flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster.
{"title":"FPGA-Accelerated Spreading for Global Placement","authors":"Shounak Dhar, L. Singhal, M. Iyer, D. Pan","doi":"10.1109/HPEC.2019.8916251","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916251","url":null,"abstract":"Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with comparable quality when integrated into a state-of-the-art academic placer. We formulate the spreading problem as a system of fluid flows across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm [1], our new fluid-flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121662223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916294
Charles Jin, M. Baskaran, Benoît Meister, J. Springer
With the end of Moore’s law, asynchronous task-based parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability.This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to-end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0* (OpenMP) and 9.5* (Legion) using 64 threads.
{"title":"Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer","authors":"Charles Jin, M. Baskaran, Benoît Meister, J. Springer","doi":"10.1109/HPEC.2019.8916294","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916294","url":null,"abstract":"With the end of Moore’s law, asynchronous task-based parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability.This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to-end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0* (OpenMP) and 9.5* (Legion) using 64 threads.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916407
Kai Huang, Mehmet Güngör, Xin Fang, Stratis Ioannidis, M. Leeser
Data privacy is an increasing concern in our interconnected world. Garbled circuits is an important approach used for Secure Function Evaluation (SFE); however it suffers from long garbling times. In this paper we present garbled circuits in the cloud using Amazon Web Services, and particularly Amazon F1 FPGA enabled nodes. We implement both garbler and evaluator in software, and show how F1 instances can accelerate the garbling process and rapidly adapt to several different applications. Experimental results, measured on AWS, indicate a 15 times speedup for garbling done using an FPGA. This results in total application speedup, including garbling, communications and evaluation, of close to three times over a large range of application sizes.
在我们这个互联互通的世界里,数据隐私日益受到关注。乱码电路是用于安全功能评估(SFE)的重要方法;然而,它遭受了长时间的乱码。在本文中,我们使用Amazon Web Services,特别是Amazon F1 FPGA支持的节点,展示了云中的乱码电路。我们在软件中实现了加码器和评估器,并展示了F1实例如何加速加码过程并快速适应几种不同的应用程序。在AWS上测量的实验结果表明,使用FPGA处理乱码的速度提高了15倍。这将导致整个应用程序的加速,包括乱码、通信和评估,在很大的应用程序大小范围内接近三倍。
{"title":"Garbled Circuits in the Cloud using FPGA Enabled Nodes","authors":"Kai Huang, Mehmet Güngör, Xin Fang, Stratis Ioannidis, M. Leeser","doi":"10.1109/HPEC.2019.8916407","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916407","url":null,"abstract":"Data privacy is an increasing concern in our interconnected world. Garbled circuits is an important approach used for Secure Function Evaluation (SFE); however it suffers from long garbling times. In this paper we present garbled circuits in the cloud using Amazon Web Services, and particularly Amazon F1 FPGA enabled nodes. We implement both garbler and evaluator in software, and show how F1 instances can accelerate the garbling process and rapidly adapt to several different applications. Experimental results, measured on AWS, indicate a 15 times speedup for garbling done using an FPGA. This results in total application speedup, including garbling, communications and evaluation, of close to three times over a large range of application sizes.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131077797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916498
Xiaoyun Wang, Zhongyi Lin, Carl Yang, John Douglas Owens
This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is $1.94 times $ faster than Suite-Sparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.
{"title":"Accelerating DNN Inference with GraphBLAS and the GPU","authors":"Xiaoyun Wang, Zhongyi Lin, Carl Yang, John Douglas Owens","doi":"10.1109/HPEC.2019.8916498","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916498","url":null,"abstract":"This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is $1.94 times $ faster than Suite-Sparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121441934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916239
Hao Wen, W. Zhang
Unlike the traditional CPU-GPU heterogeneous architecture where CPU and GPU have separate DRAM and memory address space, current heterogeneous CPU-GPU architectures integrate CPU and GPU in the same die and share the same last level cache (LLC) and memory. For the two-level cache hierarchy in which CPU and GPU have their own private L1 caches but share the LLC, conflict misses in the LLC between CPU and GPU may degrade both CPU and GPU performance. In addition, how the CPU and GPU memory requests flows (write back flow from L1 and cache fill flow from main memory) are managed may impact the performance. In this work, we study three different cache requests flow management policies. The first policy is selective GPU LLC fill, which selectively fills the GPU requests in the LLC. The second policy is selective GPU L1 write back, which selectively writes back GPU blocks in L1 cache to L2 cache. The final policy is a hybrid policy that combines the first two, and selectively replaces CPU blocks in the LLC. Our experimental results indicate that the third policy is the best of these three. On average, it can improve the CPU performance by about 10%, with the highest CPU performance improvement of 22%, with 0.8% averaged GPU performance overhead.
{"title":"Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture","authors":"Hao Wen, W. Zhang","doi":"10.1109/HPEC.2019.8916239","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916239","url":null,"abstract":"Unlike the traditional CPU-GPU heterogeneous architecture where CPU and GPU have separate DRAM and memory address space, current heterogeneous CPU-GPU architectures integrate CPU and GPU in the same die and share the same last level cache (LLC) and memory. For the two-level cache hierarchy in which CPU and GPU have their own private L1 caches but share the LLC, conflict misses in the LLC between CPU and GPU may degrade both CPU and GPU performance. In addition, how the CPU and GPU memory requests flows (write back flow from L1 and cache fill flow from main memory) are managed may impact the performance. In this work, we study three different cache requests flow management policies. The first policy is selective GPU LLC fill, which selectively fills the GPU requests in the LLC. The second policy is selective GPU L1 write back, which selectively writes back GPU blocks in L1 cache to L2 cache. The final policy is a hybrid policy that combines the first two, and selectively replaces CPU blocks in the LLC. Our experimental results indicate that the third policy is the best of these three. On average, it can improve the CPU performance by about 10%, with the highest CPU performance improvement of 22%, with 0.8% averaged GPU performance overhead.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121939566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916438
Loc Hoang, Vishwesh Jatala, Xuhao Chen, U. Agarwal, Roshan Dathathri, G. Gill, K. Pingali
We describe a novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph partitioning strategy that eliminates almost all inter-host communication during triangle counting. Experimental results show that this distributed triangle counting implementation can handle very large graphs such as clueweb12, which has almost one billion vertices and 37 billion edges, and it is up to 1.6× faster than TriCore, the 2018 Graph Challenge champion.
{"title":"DistTC: High Performance Distributed Triangle Counting","authors":"Loc Hoang, Vishwesh Jatala, Xuhao Chen, U. Agarwal, Roshan Dathathri, G. Gill, K. Pingali","doi":"10.1109/HPEC.2019.8916438","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916438","url":null,"abstract":"We describe a novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph partitioning strategy that eliminates almost all inter-host communication during triangle counting. Experimental results show that this distributed triangle counting implementation can handle very large graphs such as clueweb12, which has almost one billion vertices and 37 billion edges, and it is up to 1.6× faster than TriCore, the 2018 Graph Challenge champion.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"63 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123460675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}