首页 > 最新文献

2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing 研究用于机载空间处理的TI KeyStone II和四核ARM Cortex-A53架构
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091094
B. Schwaller, B. Ramesh, A. George
Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.
未来的太空任务需要具有更高性能和更低功耗的可靠架构。探索值得经历昂贵且耗时的辐射硬化过程的新架构对于这一努力至关重要。两种这样的架构是德州仪器的KeyStone II八核处理器和ARM®Cortex®-A53 (ARMv8)四核CPU。DSP已经在先前的空间应用中得到了验证,KeyStone II具有8个高性能DSP内核,并且正在考虑潜在的空间硬化。同时,波音公司正在NASA/AFRL高性能航天计算计划下开发抗辐射四核ARM Cortex-A53 CPU。在本文中,我们优化和评估了批处理1d - fft、2d - fft和复杂模糊函数(CAF)的性能。我们开发了一种直接内存访问方案来利用fft复杂的KeyStone架构。我们对批处理1d - fft的结果表明,KeyStone II的每瓦性能是ARM Cortex-A53的4.5倍。对于CAF,我们的结果表明KeyStone II的效果是其1.7倍。
{"title":"Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing","authors":"B. Schwaller, B. Ramesh, A. George","doi":"10.1109/HPEC.2017.8091094","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091094","url":null,"abstract":"Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"137 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128700706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Towards numerical benchmark for half-precision floating point arithmetic 半精度浮点运算的数值基准
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091031
P. Luszczek, J. Kurzak, I. Yamazaki, J. Dongarra
With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.
通过NVIDIA Tegra Jetson X1和Pascal P100 gpu, NVIDIA在FP16上引入了基于硬件的计算,也称为半精度算法。在这次演讲中,我们将介绍为这种新的算术格式构建一个可行的基准所需的步骤。这将包括与已建立的IEEE浮点标准和现有HPC基准的连接。讨论将集中在性能和数值稳定性问题上,这对这种基准测试很重要,以及它们与NVIDIA平台的关系。
{"title":"Towards numerical benchmark for half-precision floating point arithmetic","authors":"P. Luszczek, J. Kurzak, I. Yamazaki, J. Dongarra","doi":"10.1109/HPEC.2017.8091031","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091031","url":null,"abstract":"With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134041593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA 卷积神经网络在Intel®Xeon®集成FPGA处理器上的应用
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091025
Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis
Intel®'s Xeon® processor with integrated FPGA is a new research platform that provides all the capabilities of a Broadwell Xeon Processor with the added functionality of an Arria 10 FPGA in the same package. In this paper, we present an implementation on this platform to showcase the abilities and effectiveness of utilizing both hardware architectures to accelerate a convolutional based neural network (CNN). We choose a network topology that uses binary weights and low precision activation data to take advantage of the available customizable fabric provided by the FPGA. Further, compared to standard multiply accumulate CNN's, binary weighted networks (BWN) reduce the amount of computation by eliminating the need for multiplication resulting in little to no classification accuracy degradation. Coupling Intel's Open Programmable Acceleration Engine (OPAE) with Caffe provides a robust framework that was used as the foundation for our application. Due to the convolution primitives taking the most computation in our network, we offload the feature and weight data to a customized binary convolution accelerator loaded in the FPGA. Employing the low latency Quick Path Interconnect (QPI) that bridges the Broadwell Xeon processor and Arria 10 FPGA, we can carry out fine-grained offloads while avoiding bandwidth bottlenecks. An initial proof of concept design showcasing this new platform that utilizes only a portion of the FPGA core logic exemplifies that by using both the Xeon processor and FPGA together we can improve the throughput by 2× on some layers and by 1.3× overall.
Intel®的Xeon®集成FPGA处理器是一个新的研究平台,在相同的封装中提供Broadwell Xeon处理器的所有功能和Arria 10 FPGA的附加功能。在本文中,我们提出了该平台上的实现,以展示利用两种硬件架构来加速基于卷积的神经网络(CNN)的能力和有效性。我们选择使用二进制权重和低精度激活数据的网络拓扑,以利用FPGA提供的可用定制结构。此外,与标准乘法累积CNN相比,二元加权网络(BWN)通过消除乘法的需要减少了计算量,导致分类精度几乎没有下降。将英特尔的开放可编程加速引擎(OPAE)与Caffe相结合,提供了一个健壮的框架,作为我们应用程序的基础。由于卷积原语在我们的网络中占用了最多的计算,我们将特征和权重数据卸载到FPGA中加载的定制二进制卷积加速器中。采用桥接Broadwell至强处理器和Arria 10 FPGA的低延迟快速路径互连(QPI),我们可以在避免带宽瓶颈的同时进行细粒度的卸载。最初的概念验证设计展示了这个仅利用部分FPGA核心逻辑的新平台,通过同时使用至强处理器和FPGA,我们可以将某些层的吞吐量提高2倍,整体吞吐量提高1.3倍。
{"title":"Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA","authors":"Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis","doi":"10.1109/HPEC.2017.8091025","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091025","url":null,"abstract":"Intel®'s Xeon® processor with integrated FPGA is a new research platform that provides all the capabilities of a Broadwell Xeon Processor with the added functionality of an Arria 10 FPGA in the same package. In this paper, we present an implementation on this platform to showcase the abilities and effectiveness of utilizing both hardware architectures to accelerate a convolutional based neural network (CNN). We choose a network topology that uses binary weights and low precision activation data to take advantage of the available customizable fabric provided by the FPGA. Further, compared to standard multiply accumulate CNN's, binary weighted networks (BWN) reduce the amount of computation by eliminating the need for multiplication resulting in little to no classification accuracy degradation. Coupling Intel's Open Programmable Acceleration Engine (OPAE) with Caffe provides a robust framework that was used as the foundation for our application. Due to the convolution primitives taking the most computation in our network, we offload the feature and weight data to a customized binary convolution accelerator loaded in the FPGA. Employing the low latency Quick Path Interconnect (QPI) that bridges the Broadwell Xeon processor and Arria 10 FPGA, we can carry out fine-grained offloads while avoiding bandwidth bottlenecks. An initial proof of concept design showcasing this new platform that utilizes only a portion of the FPGA core logic exemplifies that by using both the Xeon processor and FPGA together we can improve the throughput by 2× on some layers and by 1.3× overall.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131791530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
WCET analysis of the shared data cache in integrated CPU-GPU architectures 集成CPU-GPU架构中共享数据缓存的WCET分析
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091059
Y. Huangfu, Wei Zhang
By taking the advantages of both CPU and GPU as well as the shared DRAM and cache, the integrated CPU-GPU architecture has the potential to boost the performance for a variety of applications, including real-time applications as well. However, before being applied to the hard real-time and safety-critical applications, the time-predictability of the integrated CPU-GPU architecture needs to be studied and improved. In this work, we study the shared data Last Level Cache (LLC) in the integrated CPU-GPU architecture and propose to use an access interval based method to improve the time-predictability of the LLC. The results show that the proposed technique can effectively improve the accuracy of the miss rate estimation in the LLC. We also find that the improved LLC miss rate estimations can be used to further improve the WCET estimations of GPU kernels running on such an architecture.
通过利用CPU和GPU以及共享DRAM和缓存的优势,集成的CPU-GPU架构有可能提高各种应用程序的性能,包括实时应用程序。然而,在应用于硬实时性和安全关键应用之前,CPU-GPU集成架构的时间可预测性还需要进一步研究和改进。在这项工作中,我们研究了去年水平共享数据缓存(LLC)集成CPU-GPU架构并提出使用基于访问时间间隔的方法来提高time-predictability LLC。结果表明,该技术能有效改善错过率估计的准确性LLC。我们还发现改善LLC错过率估计可以用来进一步提高GPU的WCET估计内核上运行这样一个架构。
{"title":"WCET analysis of the shared data cache in integrated CPU-GPU architectures","authors":"Y. Huangfu, Wei Zhang","doi":"10.1109/HPEC.2017.8091059","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091059","url":null,"abstract":"By taking the advantages of both CPU and GPU as well as the shared DRAM and cache, the integrated CPU-GPU architecture has the potential to boost the performance for a variety of applications, including real-time applications as well. However, before being applied to the hard real-time and safety-critical applications, the time-predictability of the integrated CPU-GPU architecture needs to be studied and improved. In this work, we study the shared data Last Level Cache (LLC) in the integrated CPU-GPU architecture and propose to use an access interval based method to improve the time-predictability of the LLC. The results show that the proposed technique can effectively improve the accuracy of the miss rate estimation in the LLC. We also find that the improved LLC miss rate estimations can be used to further improve the WCET estimations of GPU kernels running on such an architecture.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134009926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi 功耗感知计算:Intel Xeon Phi的测量、控制和性能分析
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091085
A. Haidar, Heike Jagode, A. YarKhan, Phil Vaccaro, S. Tomov, J. Dongarra
The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa-scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial. We present a detailed study and investigation toward controlling power usage and exploring how different power caps affect the performance of numerical algorithms with different computational intensities, and determine the impact and correlation with performance of scientific applications. Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms.
功率效率作为处理器和系统设计的主要约束因素的出现,对数字库和科学应用的功率和能量意识提出了新的挑战。功耗在数据中心的设计中也起着重要作用,特别是对于peta和exa-scale系统。了解和提高数值模拟的能量效率变得至关重要。本文对控制功耗、不同功率上限对不同计算强度的数值算法性能的影响进行了详细的研究和探讨,并确定了科学应用与性能的影响和相关性。我们的分析是使用一组代表性内核以及许多高度使用的科学基准来执行的。我们量化了一些功率和性能测量,并得出了可以被视为实现能效计算算法的路线图的观察结果和结论。
{"title":"Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi","authors":"A. Haidar, Heike Jagode, A. YarKhan, Phil Vaccaro, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2017.8091085","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091085","url":null,"abstract":"The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa-scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial. We present a detailed study and investigation toward controlling power usage and exploring how different power caps affect the performance of numerical algorithms with different computational intensities, and determine the impact and correlation with performance of scientific applications. Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114180675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Fast linear algebra-based triangle counting with KokkosKernels 快速线性代数为基础的三角形计数与KokkosKernels
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091043
Michael M. Wolf, Mehmet Deveci, Jonathan W. Berry, S. Hammond, S. Rajamanickam
Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.
三角形计数是网络科学中一组重要图算法的关键组成部分。在本文中,我们解决了三角形计数的IEEE HPEC静态图挑战问题,重点是在单个多核节点上获得最佳并行性能。我们的实现使用基于线性代数的方法来进行三角形计数,这是与我们的miniTri数据分析迷你应用程序[1]相关的工作,以及我们在线性代数语言中提出图形算法的努力。我们利用KokkosKernels在多核架构上有效地实现了这种方法。我们的性能结果与已知最快的基于图遍历的方法相媲美,并且比graph Challenge参考实现要快得多,比c++参考实现快67万倍,比单个Intel Haswell节点上的Python参考实现快1万倍。
{"title":"Fast linear algebra-based triangle counting with KokkosKernels","authors":"Michael M. Wolf, Mehmet Deveci, Jonathan W. Berry, S. Hammond, S. Rajamanickam","doi":"10.1109/HPEC.2017.8091043","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091043","url":null,"abstract":"Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117052302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Triangle counting for scale-free graphs at scale in distributed memory 在分布式内存中按比例计算无比例图的三角形计数
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091051
R. Pearce
Triangle counting has long been a challenge problem for sparse graphs containing high-degree "hub" vertices that exist in many real-world scenarios. These high-degree vertices create a quadratic number of wedges, or 2-edge paths, which for brute force algorithms require closure checking or wedge checks. Our work-in-progress builds on existing heuristics for pruning the number of wedge checks by ordering based on degree and other simple metrics. Such heuristics can dramatically reduce the number of required wedge checks for exact triangle counting for both real and synthetic scale-free graphs. Our triangle counting algorithm is implemented using HavoqGT, an asynchronous vertex-centric graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph, and a weak scaling study on synthetic Graph500 RMAT graphs up to 274.9 billion edges.
三角形计数长期以来一直是包含高度“枢纽”顶点的稀疏图的挑战问题,这些稀疏图存在于许多现实场景中。这些高度顶点创建了二次数量的楔形,或2边路径,对于蛮力算法来说,这需要闭合检查或楔形检查。我们正在进行的工作建立在现有的启发式方法上,通过基于程度和其他简单指标的排序来减少楔形检查的数量。这种启发式方法可以大大减少对真实和合成无标度图进行精确三角形计数所需的楔形检查次数。我们的三角形计数算法是使用HavoqGT实现的,HavoqGT是一个异步的以顶点为中心的分布式内存图形分析框架。我们对两个大型真实无标度图(128B边缘web图和14b边缘twitter follower图)进行了简要的实验评估,并对合成Graph500 RMAT图进行了弱标度研究。
{"title":"Triangle counting for scale-free graphs at scale in distributed memory","authors":"R. Pearce","doi":"10.1109/HPEC.2017.8091051","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091051","url":null,"abstract":"Triangle counting has long been a challenge problem for sparse graphs containing high-degree \"hub\" vertices that exist in many real-world scenarios. These high-degree vertices create a quadratic number of wedges, or 2-edge paths, which for brute force algorithms require closure checking or wedge checks. Our work-in-progress builds on existing heuristics for pruning the number of wedge checks by ordering based on degree and other simple metrics. Such heuristics can dramatically reduce the number of required wedge checks for exact triangle counting for both real and synthetic scale-free graphs. Our triangle counting algorithm is implemented using HavoqGT, an asynchronous vertex-centric graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph, and a weak scaling study on synthetic Graph500 RMAT graphs up to 274.9 billion edges.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123028113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Sparse matrix assembly on the GPU through multiplication patterns 稀疏矩阵在GPU上通过乘法模式组装
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091057
Rhaleb Zayer, M. Steinberger, H. Seidel
The numerical treatment of variational problems gives rise to large sparse matrices, which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by numerical solvers, the assembly step can be a potential bottleneck, especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can be vectorized by taking advantage of additional mesh querying data structures. However, on graphics hardware, vectorization is inhibited by limited memory resources. In this paper, we propose a lean unstructured mesh representation, which allows casting the assembly problem as a sparse matrix-matrix multiplication. We demonstrate how the global graph connectivity of the assembled matrix can be captured through basic linear algebra operations and show how local interactions between nodes/degrees of freedom within an element can be encoded by means of concise representation, action maps. These ideas not only reduce the memory storage requirements but also cut down on the bulk of data that needs to be moved from global storage to the compute units, which is crucial on parallel computing hardware, and in particular on the GPU. Furthermore, we analyze the effect of mesh memory layout on the assembly performance.
变分问题的数值处理产生了大的稀疏矩阵,这些矩阵通常是通过合并初等贡献来组装的。由于数值求解器需要显式矩阵形式,因此装配步骤可能是一个潜在的瓶颈,特别是在需要大量更新的隐式和时间相关设置中。在标准的HPC平台上,可以利用额外的网格查询数据结构对该过程进行矢量化。然而,在图形硬件上,矢量化受到有限内存资源的限制。在本文中,我们提出了一种精简的非结构化网格表示,它允许将装配问题转换为稀疏矩阵-矩阵乘法。我们演示了如何通过基本的线性代数操作捕获组合矩阵的全局图连通性,并展示了如何通过简洁的表示,动作图来编码元素内节点/自由度之间的局部相互作用。这些想法不仅减少了内存存储需求,还减少了需要从全局存储移动到计算单元的大量数据,这在并行计算硬件上是至关重要的,特别是在GPU上。此外,我们还分析了网格存储器布局对装配性能的影响。
{"title":"Sparse matrix assembly on the GPU through multiplication patterns","authors":"Rhaleb Zayer, M. Steinberger, H. Seidel","doi":"10.1109/HPEC.2017.8091057","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091057","url":null,"abstract":"The numerical treatment of variational problems gives rise to large sparse matrices, which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by numerical solvers, the assembly step can be a potential bottleneck, especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can be vectorized by taking advantage of additional mesh querying data structures. However, on graphics hardware, vectorization is inhibited by limited memory resources. In this paper, we propose a lean unstructured mesh representation, which allows casting the assembly problem as a sparse matrix-matrix multiplication. We demonstrate how the global graph connectivity of the assembled matrix can be captured through basic linear algebra operations and show how local interactions between nodes/degrees of freedom within an element can be encoded by means of concise representation, action maps. These ideas not only reduce the memory storage requirements but also cut down on the bulk of data that needs to be moved from global storage to the compute units, which is crucial on parallel computing hardware, and in particular on the GPU. Furthermore, we analyze the effect of mesh memory layout on the assembly performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122893597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Ultra-high fidelity radio frequency propagation modeling using distributed high performance graphical processing units: A simulator for multi-element non-stationary antenna systems 使用分布式高性能图形处理单元的超高保真无线电频率传播建模:多单元非固定天线系统的模拟器
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091082
Mark D. Barnell, Nathan Stokes, Jason Steeger, Jessie Grabowski
A newly-invented, distributed, high-performance graphical processing framework that simulates complex radio frequency (RF) propagation has been developed and demonstrated. The approach uses an advanced computer architecture and intensive multi-core system to enable highperformance data analysis at the fidelity necessary to design and develop modern sensor systems. This widely applicable simulation and modeling technology aids in the design and development of state-of-the-art systems with complex waveforms and more advanced downstream exploitation techniques, e.g., systems with arbitrary RF waveforms, higher RF bandwidths and increasing resolution. The recent breakthroughs in computing hardware, software, systems and applications has enabled these concepts to be tested and demonstrated in a large variety of environments and early in the design cycle. Improvements in simulation accuracies and simulation timescales have been made that immediately increase the value to the end user. A near-analytic RF propagation model increased the computational need by orders of magnitude. This model also increased required numerical precision. The new general purpose graphics processing units (GPGPUs) provided the capability to simulate the propagation effects and model it with the necessary information dependence, and floating point mathematics where performance matters. The relative performance improvement between the baseline MATLAB® parallelized simulation and the equivalent GPU based simulation using 12 NVIDIA Tesla K20m GPUs on the Offspring High-Performance Computer (HPC) using the AirWASP© framework decreased simulation and modeling from 16.5 days to less than 1 day.
一个新发明的,分布式的,高性能的图形处理框架,模拟复杂的射频(RF)传播已经开发和演示。该方法使用先进的计算机体系结构和密集的多核系统,以实现设计和开发现代传感器系统所需的保真度的高性能数据分析。这种广泛应用的仿真和建模技术有助于设计和开发具有复杂波形的最先进系统和更先进的下游开发技术,例如具有任意RF波形、更高RF带宽和不断提高的分辨率的系统。最近在计算硬件、软件、系统和应用程序方面的突破使这些概念能够在各种各样的环境和设计周期的早期进行测试和演示。在仿真精度和仿真时间尺度上的改进,立即增加了对最终用户的价值。近解析的射频传播模型使计算量增加了几个数量级。该模型还提高了所需的数值精度。新的通用图形处理单元(gpgpu)提供了模拟传播效果的能力,并使用必要的信息依赖关系和浮点数学对其进行建模,其中性能很重要。在使用AirWASP©框架的后代高性能计算机(HPC)上使用12个NVIDIA Tesla K20m GPU的基准MATLAB®并行仿真和等效GPU仿真之间的相对性能改进将仿真和建模从16.5天减少到不到1天。
{"title":"Ultra-high fidelity radio frequency propagation modeling using distributed high performance graphical processing units: A simulator for multi-element non-stationary antenna systems","authors":"Mark D. Barnell, Nathan Stokes, Jason Steeger, Jessie Grabowski","doi":"10.1109/HPEC.2017.8091082","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091082","url":null,"abstract":"A newly-invented, distributed, high-performance graphical processing framework that simulates complex radio frequency (RF) propagation has been developed and demonstrated. The approach uses an advanced computer architecture and intensive multi-core system to enable highperformance data analysis at the fidelity necessary to design and develop modern sensor systems. This widely applicable simulation and modeling technology aids in the design and development of state-of-the-art systems with complex waveforms and more advanced downstream exploitation techniques, e.g., systems with arbitrary RF waveforms, higher RF bandwidths and increasing resolution. The recent breakthroughs in computing hardware, software, systems and applications has enabled these concepts to be tested and demonstrated in a large variety of environments and early in the design cycle. Improvements in simulation accuracies and simulation timescales have been made that immediately increase the value to the end user. A near-analytic RF propagation model increased the computational need by orders of magnitude. This model also increased required numerical precision. The new general purpose graphics processing units (GPGPUs) provided the capability to simulate the propagation effects and model it with the necessary information dependence, and floating point mathematics where performance matters. The relative performance improvement between the baseline MATLAB® parallelized simulation and the equivalent GPU based simulation using 12 NVIDIA Tesla K20m GPUs on the Offspring High-Performance Computer (HPC) using the AirWASP© framework decreased simulation and modeling from 16.5 days to less than 1 day.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A cloud-based brain connectivity analysis tool 基于云的大脑连接分析工具
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091080
L. Brattain, Mihnea Bulugioiu, Adam Brewster, Mark Hernandez, Heejin Choi, T. Ku, Kwanghun Chung, V. Gadepally
With advances in high throughput brain imaging at the cellular and sub-cellular level, there is growing demand for platforms that can support high performance, large-scale brain data processing and analysis. In this paper, we present a novel pipeline that combines Accumulo, D4M, geohashing, and parallel programming to manage large-scale neuron connectivity graphs in a cloud environment. Our brain connectivity graph is represented using vertices (fiber start/end nodes), edges (fiber tracks), and the 3D coordinates of the fiber tracks. For optimal performance, we take the hybrid approach of storing vertices and edges in Accumulo and saving the fiber track 3D coordinates in flat files. Accumulo database operations offer low latency on sparse queries while flat files offer high throughput for storing, querying, and analyzing bulk data. We evaluated our pipeline by using 250 gigabytes of mouse neuron connectivity data. Benchmarking experiments on retrieving vertices and edges from Accumulo demonstrate that we can achieve 1–2 orders of magnitude speedup in retrieval time when compared to the same operation from traditional flat files. The implementation of graph analytics such as Breadth First Search using Accumulo and D4M offers consistent good performance regardless of data size and density, thus is scalable to very large dataset. Indexing of neuron subvolumes is simple and logical with geohashing-based binary tree encoding. This hybrid data management backend is used to drive an interactive web-based 3D graphical user interface, where users can examine the 3D connectivity map in a Google Map-like viewer. Our pipeline is scalable and extensible to other data modalities.
随着细胞和亚细胞水平的高通量脑成像技术的进步,对能够支持高性能、大规模大脑数据处理和分析的平台的需求不断增长。在本文中,我们提出了一种新的管道,它结合了Accumulo、D4M、geohash和并行编程来管理云环境中的大规模神经元连接图。我们的大脑连接图是用顶点(光纤起始/结束节点)、边缘(光纤轨迹)和光纤轨迹的3D坐标来表示的。为了获得最佳性能,我们采用混合方法,将顶点和边缘存储在Accumulo中,并将光纤轨迹三维坐标保存在平面文件中。累加数据库操作为稀疏查询提供了低延迟,而平面文件为存储、查询和分析大量数据提供了高吞吐量。我们通过使用250g的小鼠神经元连接数据来评估我们的管道。从Accumulo中检索顶点和边缘的基准测试实验表明,与传统平面文件的相同操作相比,我们可以在检索时间上实现1-2个数量级的加速。使用Accumulo和D4M的广度优先搜索等图形分析的实现,无论数据大小和密度如何,都能提供一致的良好性能,因此可以扩展到非常大的数据集。使用基于geohash的二叉树编码对神经元子卷进行索引是简单而合乎逻辑的。这个混合数据管理后端用于驱动基于web的交互式3D图形用户界面,用户可以在类似Google地图的查看器中查看3D连接地图。我们的管道是可伸缩的,可扩展到其他数据模式。
{"title":"A cloud-based brain connectivity analysis tool","authors":"L. Brattain, Mihnea Bulugioiu, Adam Brewster, Mark Hernandez, Heejin Choi, T. Ku, Kwanghun Chung, V. Gadepally","doi":"10.1109/HPEC.2017.8091080","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091080","url":null,"abstract":"With advances in high throughput brain imaging at the cellular and sub-cellular level, there is growing demand for platforms that can support high performance, large-scale brain data processing and analysis. In this paper, we present a novel pipeline that combines Accumulo, D4M, geohashing, and parallel programming to manage large-scale neuron connectivity graphs in a cloud environment. Our brain connectivity graph is represented using vertices (fiber start/end nodes), edges (fiber tracks), and the 3D coordinates of the fiber tracks. For optimal performance, we take the hybrid approach of storing vertices and edges in Accumulo and saving the fiber track 3D coordinates in flat files. Accumulo database operations offer low latency on sparse queries while flat files offer high throughput for storing, querying, and analyzing bulk data. We evaluated our pipeline by using 250 gigabytes of mouse neuron connectivity data. Benchmarking experiments on retrieving vertices and edges from Accumulo demonstrate that we can achieve 1–2 orders of magnitude speedup in retrieval time when compared to the same operation from traditional flat files. The implementation of graph analytics such as Breadth First Search using Accumulo and D4M offers consistent good performance regardless of data size and density, thus is scalable to very large dataset. Indexing of neuron subvolumes is simple and logical with geohashing-based binary tree encoding. This hybrid data management backend is used to drive an interactive web-based 3D graphical user interface, where users can examine the 3D connectivity map in a Google Map-like viewer. Our pipeline is scalable and extensible to other data modalities.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2017 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1