首页 > 最新文献

2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Low Power Computing and Simultaneous Electro-Optical/Radar Data Processing using IBM’s NS16e 16-chip Neuromorphic Hardware 使用IBM的NS16e 16芯片神经形态硬件的低功耗计算和同步光电/雷达数据处理
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916311
Mark D. Barnell, Courtney Raymond, Daniel Brown, Matthew Wilson, Éric Côté
For the first time ever, advanced machine learning (ML) compute architectures, techniques, and methods were demonstrated on United States Geological Survey (USGS) optical imagery and Department of Defense (DoD) Synthetic Aperture Radar (SAR) imagery, simultaneously, using IBM’s new NS16e neurosynaptic processor board comprised of 16 TrueNorth chips. The Air Force Research Laboratory (AFRL) Information Directorate Advanced Computing and Communications Division continues to develop and demonstrate new bio-inspired computing algorithms and architectures, designed to provide advanced, ultra-low power, ground and airborne High-Performance Computing (HPC) solutions to meet operational and tactical, real-time processing needs for Intelligence, Surveillance, and Reconnaissance (ISR) missions on small form factor hardware, and in Size, Weight and Power (SWaP) constrained environments. With an average throughput of 16,000 inferences per second, the system provided a processing efficiency of 1,066 inferences per Watt. The NS16e power utilization never exceeded 15 Watts for this application. The contribution of power consumption from TrueNorth processors was bound to less than 5.5 Watts.
先进的机器学习(ML)计算架构、技术和方法首次在美国地质调查局(USGS)光学图像和国防部(DoD)合成孔径雷达(SAR)图像上同时展示,使用IBM的新型NS16e神经突触处理器板,该处理器板由16个TrueNorth芯片组成。美国空军研究实验室(AFRL)信息理事会高级计算和通信部门继续开发和演示新的生物启发计算算法和架构,旨在提供先进、超低功耗、地面和机载高性能计算(HPC)解决方案,以满足情报、监视和侦察(ISR)任务在小尺寸硬件上的作战和战术实时处理需求。重量和功率(SWaP)受限的环境。该系统的平均吞吐量为每秒16,000次推理,处理效率为每瓦特1,066次推理。在此应用中,NS16e的功率利用率从未超过15瓦。TrueNorth处理器的功耗贡献将低于5.5瓦。
{"title":"Low Power Computing and Simultaneous Electro-Optical/Radar Data Processing using IBM’s NS16e 16-chip Neuromorphic Hardware","authors":"Mark D. Barnell, Courtney Raymond, Daniel Brown, Matthew Wilson, Éric Côté","doi":"10.1109/HPEC.2019.8916311","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916311","url":null,"abstract":"For the first time ever, advanced machine learning (ML) compute architectures, techniques, and methods were demonstrated on United States Geological Survey (USGS) optical imagery and Department of Defense (DoD) Synthetic Aperture Radar (SAR) imagery, simultaneously, using IBM’s new NS16e neurosynaptic processor board comprised of 16 TrueNorth chips. The Air Force Research Laboratory (AFRL) Information Directorate Advanced Computing and Communications Division continues to develop and demonstrate new bio-inspired computing algorithms and architectures, designed to provide advanced, ultra-low power, ground and airborne High-Performance Computing (HPC) solutions to meet operational and tactical, real-time processing needs for Intelligence, Surveillance, and Reconnaissance (ISR) missions on small form factor hardware, and in Size, Weight and Power (SWaP) constrained environments. With an average throughput of 16,000 inferences per second, the system provided a processing efficiency of 1,066 inferences per Watt. The NS16e power utilization never exceeded 15 Watts for this application. The contribution of power consumption from TrueNorth processors was bound to less than 5.5 Watts.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123102547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining Tensor Decompositions and Graph Analytics to Provide Cyber Situational Awareness at HPC Scale 结合张量分解和图形分析提供高性能计算规模的网络态势感知
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916559
J. Ezick, Ben Parsons, W. Glodek, Thomas Henretty, M. Baskaran, R. Lethin, J. Feo, Tai-Ching Tuan, Christopher J. Coley, Leslie Leonard, R. Agrawal
This paper describes MADHAT (Multidimensional Anomaly Detection fusing HPC, Analytics, and Tensors), an integrated workflow that demonstrates the applicability of HPC resources to the problem of maintaining cyber situational awareness. MADHAT combines two high-performance packages: ENSIGN for large-scale sparse tensor decompositions and HAGGLE for graph analytics. Tensor decompositions isolate coherent patterns of network behavior in ways that common clustering methods based on distance metrics cannot. Parallelized graph analysis then uses directed queries on a representation that combines the elements of identified patterns with other available information (such as additional log fields, domain knowledge, network topology, whitelists and blacklists, prior feedback, and published alerts) to confirm or reject a threat hypothesis, collect context, and raise alerts. MADHAT was developed using the collaborative HPC Architecture for Cyber Situational Awareness (HACSAW) research environment and evaluated on structured network sensor logs collected from Defense Research and Engineering Network (DREN) sites using HPC resources at the U.S. Army Engineer Research and Development Center DoD Supercomputing Resource Center (ERDC DSRC). To date, MADHAT has analyzed logs with over 650 million entries.
本文描述了MADHAT(融合HPC、分析和张量的多维异常检测),这是一个集成的工作流,展示了HPC资源对维护网络态势感知问题的适用性。MADHAT结合了两个高性能软件包:用于大规模稀疏张量分解的ENSIGN和用于图形分析的HAGGLE。张量分解分离出网络行为的相干模式,这是基于距离度量的普通聚类方法无法做到的。然后,并行图分析对一个表示使用定向查询,该表示将已识别模式的元素与其他可用信息(如附加日志字段、领域知识、网络拓扑、白名单和黑名单、先前反馈和发布的警报)结合起来,以确认或拒绝威胁假设、收集上下文并发出警报。MADHAT是使用协作式HPC架构用于网络态势感知(HACSAW)研究环境开发的,并使用美国陆军工程研究与发展中心国防部超级计算资源中心(ERDC DSRC)的HPC资源,对从国防研究与工程网络(DREN)站点收集的结构化网络传感器日志进行了评估。到目前为止,MADHAT已经分析了超过6.5亿个条目的日志。
{"title":"Combining Tensor Decompositions and Graph Analytics to Provide Cyber Situational Awareness at HPC Scale","authors":"J. Ezick, Ben Parsons, W. Glodek, Thomas Henretty, M. Baskaran, R. Lethin, J. Feo, Tai-Ching Tuan, Christopher J. Coley, Leslie Leonard, R. Agrawal","doi":"10.1109/HPEC.2019.8916559","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916559","url":null,"abstract":"This paper describes MADHAT (Multidimensional Anomaly Detection fusing HPC, Analytics, and Tensors), an integrated workflow that demonstrates the applicability of HPC resources to the problem of maintaining cyber situational awareness. MADHAT combines two high-performance packages: ENSIGN for large-scale sparse tensor decompositions and HAGGLE for graph analytics. Tensor decompositions isolate coherent patterns of network behavior in ways that common clustering methods based on distance metrics cannot. Parallelized graph analysis then uses directed queries on a representation that combines the elements of identified patterns with other available information (such as additional log fields, domain knowledge, network topology, whitelists and blacklists, prior feedback, and published alerts) to confirm or reject a threat hypothesis, collect context, and raise alerts. MADHAT was developed using the collaborative HPC Architecture for Cyber Situational Awareness (HACSAW) research environment and evaluated on structured network sensor logs collected from Defense Research and Engineering Network (DREN) sites using HPC resources at the U.S. Army Engineer Research and Development Center DoD Supercomputing Resource Center (ERDC DSRC). To date, MADHAT has analyzed logs with over 650 million entries.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123546394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Prototype Container-Based Platform for Extreme Quantum Computing Algorithm Development 基于原型容器的极限量子计算算法开发平台
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916430
P. Dreher, Madhuvanti Ramasami
Recent advances in the development of the first generation of quantum computing devices have provided researchers with computational platforms to explore new ideas and reformulate conventional computational codes suitable for a quantum computer. Developers can now implement these reformulations on both quantum simulators and hardware platforms through a cloud computing software environment. For example, the IBM Q Experience provides the direct access to their quantum simulators and quantum computing hardware platforms. However these current access options may not be an optimal environment for developers needing to download and modify the source codes and libraries. This paper focuses on the construction of a Docker container environment with Qiskit source codes and libraries running on a local cloud computing system that can directly access the IBM Q Experience. This prototype container based system allows single user and small project groups to do rapid prototype development, testing and implementation of extreme capability algorithms with more agility and flexibility than can be provided through the IBM Q Experience website. This prototype environment also provides an excellent teaching environment for labs and project assignments within graduate courses in cloud computing and quantum computing. The paper also discusses computer security challenges for expanding this prototype container system to larger groups of quantum computing researchers.
第一代量子计算设备的最新进展为研究人员提供了探索新思想和重新制定适合量子计算机的传统计算代码的计算平台。开发人员现在可以通过云计算软件环境在量子模拟器和硬件平台上实现这些重新表述。例如,IBM Q Experience提供了对其量子模拟器和量子计算硬件平台的直接访问。然而,对于需要下载和修改源代码和库的开发人员来说,这些当前的访问选项可能不是最佳环境。本文主要研究在本地云计算系统上使用Qiskit源代码和库构建Docker容器环境,该环境可以直接访问IBM Q Experience。这个基于原型容器的系统允许单个用户和小型项目小组进行快速原型开发、测试和实现极限功能算法,其敏捷性和灵活性比IBM Q Experience网站提供的要高。该原型环境还为云计算和量子计算研究生课程中的实验室和项目作业提供了良好的教学环境。本文还讨论了将该原型容器系统扩展到更大的量子计算研究人员群体所面临的计算机安全挑战。
{"title":"Prototype Container-Based Platform for Extreme Quantum Computing Algorithm Development","authors":"P. Dreher, Madhuvanti Ramasami","doi":"10.1109/HPEC.2019.8916430","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916430","url":null,"abstract":"Recent advances in the development of the first generation of quantum computing devices have provided researchers with computational platforms to explore new ideas and reformulate conventional computational codes suitable for a quantum computer. Developers can now implement these reformulations on both quantum simulators and hardware platforms through a cloud computing software environment. For example, the IBM Q Experience provides the direct access to their quantum simulators and quantum computing hardware platforms. However these current access options may not be an optimal environment for developers needing to download and modify the source codes and libraries. This paper focuses on the construction of a Docker container environment with Qiskit source codes and libraries running on a local cloud computing system that can directly access the IBM Q Experience. This prototype container based system allows single user and small project groups to do rapid prototype development, testing and implementation of extreme capability algorithms with more agility and flexibility than can be provided through the IBM Q Experience website. This prototype environment also provides an excellent teaching environment for labs and project assignments within graduate courses in cloud computing and quantum computing. The paper also discusses computer security challenges for expanding this prototype container system to larger groups of quantum computing researchers.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125287470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Update on Triangle Counting on GPU 更新了GPU上的三角形计数
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916547
Carl Pearson, M. Almasri, Omer Anjum, Vikram Sharma Mailthody, Zaid Qureshi, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu
This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hints to solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs without requiring a graph analysis phase.
这项工作提出了对子图同构静态图挑战的三角形计数部分的更新。这项工作的动机是想了解CUDA统一内存对三角形计数问题的影响。首先,使用CUDA统一内存将从磁盘读取的大型图形数据与GPU内存中的图形数据结构重叠。其次,我们使用CUDA统一内存提示来解决我们上次提交的多gpu性能扩展挑战。最后,我们通过引入具有持久线程的工作窃取动态算法GPU内核来改进过去提交的单GPU内核性能,这使得性能自适应于大型图形而无需图形分析阶段。
{"title":"Update on Triangle Counting on GPU","authors":"Carl Pearson, M. Almasri, Omer Anjum, Vikram Sharma Mailthody, Zaid Qureshi, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1109/HPEC.2019.8916547","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916547","url":null,"abstract":"This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hints to solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs without requiring a graph analysis phase.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122662617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A GPU Implementation of the Sparse Deep Neural Network Graph Challenge 稀疏深度神经网络图挑战的GPU实现
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916223
M. Bisson, M. Fatica
This paper presents a CUDA implementation of the latest addition to the Graph Challenge, the inference computation on a collection of large sparse deep neural networks. A single Tesla V100 can compute the inference at 3.7 TeraEdges/s. Using the managed memory API available in CUDA allows for simple and efficient distribution of these computations across a multiGPU NVIDIA DGX-2 server.
本文介绍了图形挑战最新添加的CUDA实现,该挑战是对大型稀疏深度神经网络集合的推理计算。一台Tesla V100可以以3.7 TeraEdges/s的速度计算推理。使用CUDA中可用的托管内存API可以在多gpu NVIDIA DGX-2服务器上简单有效地分配这些计算。
{"title":"A GPU Implementation of the Sparse Deep Neural Network Graph Challenge","authors":"M. Bisson, M. Fatica","doi":"10.1109/HPEC.2019.8916223","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916223","url":null,"abstract":"This paper presents a CUDA implementation of the latest addition to the Graph Challenge, the inference computation on a collection of large sparse deep neural networks. A single Tesla V100 can compute the inference at 3.7 TeraEdges/s. Using the managed memory API available in CUDA allows for simple and efficient distribution of these computations across a multiGPU NVIDIA DGX-2 server.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124750054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data 多光谱复用距离:从时间数据中提取空间信息
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916398
A. Cabrera, R. Chamberlain, J. Beard
The problem of efficiently feeding processing elements and finding ways to reduce data movement is pervasive in computing. Efficient modeling of both temporal and spatial locality of memory references is invaluable in identifying superfluous data movement in a given application. To this end, we present a new way to infer both spatial and temporal locality using reuse distance analysis. This is accomplished by performing reuse distance analysis at different data block granularities: specifically, 64B, 4KiB, and 2MiB sizes. This process of simultaneously observing reuse distance with multiple granularities is called multi-spectral reuse distance. This approach allows for a qualitative analysis of spatial locality, through observing the shifting of mass in an application’s reuse signature at different granularities. Furthermore, the shift of mass is empirically measured by calculating the Earth Mover’s Distance between reuse signatures of an application. From the characterization, it is possible to determine how spatially dense the memory references of an application are based on the degree to which the mass has shifted (or not shifted) and how close (or far) the Earth Mover’s Distance is to zero as the data block granularity is increased. It is also possible to determine an appropriate page size from this information, and whether or not a given page is being fully utilized. From the applications profiled, it is observed that not all applications will benefit from having a larger page size. Additionally, larger data block granularities subsuming smaller ones suggest that larger pages will allow for more spatial locality exploitation, but examining the memory footprint will show whether those larger pages are fully utilized or not.
有效地输入处理元素和寻找减少数据移动的方法是计算领域普遍存在的问题。内存引用的时间和空间位置的有效建模对于识别给定应用程序中多余的数据移动是非常宝贵的。为此,我们提出了一种利用重用距离分析来推断空间局部性和时间局部性的新方法。这是通过在不同的数据块粒度上执行重用距离分析来实现的:具体来说,是64B、4KiB和2MiB大小。这种同时观测多粒度复用距离的过程称为多光谱复用距离。这种方法允许对空间局部性进行定性分析,通过观察应用程序重用特征在不同粒度上的质量变化。此外,通过计算应用程序的重复使用签名之间的距离,经验地测量了质量的移动。根据特征,可以根据质量移动(或未移动)的程度以及随着数据块粒度的增加,地球移动器的距离接近零的程度来确定应用程序的内存引用的空间密集程度。还可以根据这些信息确定适当的页面大小,以及是否充分利用了给定的页面。从所分析的应用程序中可以看出,并不是所有应用程序都能从较大的页面大小中获益。此外,更大的数据块粒度包含更小的数据块粒度表明,更大的页面将允许更多的空间局部性利用,但是检查内存占用将显示这些更大的页面是否被充分利用。
{"title":"Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data","authors":"A. Cabrera, R. Chamberlain, J. Beard","doi":"10.1109/HPEC.2019.8916398","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916398","url":null,"abstract":"The problem of efficiently feeding processing elements and finding ways to reduce data movement is pervasive in computing. Efficient modeling of both temporal and spatial locality of memory references is invaluable in identifying superfluous data movement in a given application. To this end, we present a new way to infer both spatial and temporal locality using reuse distance analysis. This is accomplished by performing reuse distance analysis at different data block granularities: specifically, 64B, 4KiB, and 2MiB sizes. This process of simultaneously observing reuse distance with multiple granularities is called multi-spectral reuse distance. This approach allows for a qualitative analysis of spatial locality, through observing the shifting of mass in an application’s reuse signature at different granularities. Furthermore, the shift of mass is empirically measured by calculating the Earth Mover’s Distance between reuse signatures of an application. From the characterization, it is possible to determine how spatially dense the memory references of an application are based on the degree to which the mass has shifted (or not shifted) and how close (or far) the Earth Mover’s Distance is to zero as the data block granularity is increased. It is also possible to determine an appropriate page size from this information, and whether or not a given page is being fully utilized. From the applications profiled, it is observed that not all applications will benefit from having a larger page size. Additionally, larger data block granularities subsuming smaller ones suggest that larger pages will allow for more spatial locality exploitation, but examining the memory footprint will show whether those larger pages are fully utilized or not.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"118 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126419628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
IP Cores for Graph Kernels on FPGAs fpga上图形核的IP核
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916363
S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna
Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.
在许多实际应用程序中,图是表示网络数据的强大抽象。执行大规模图形分析的需求已经导致为此目的广泛采用专用硬件加速器,如FPGA。在这项工作中,我们为几个关键图核开发了IP核。我们的IP核使用分区上的图形处理(GPOP)编程范例来执行图分区上的计算。将输入图划分为不重叠的分区可以提高片上数据的重用。还讨论了利用分区内和分区间并行性以及减少外部内存访问的其他优化。我们生成了具有各种顶点属性和更新传播函数的通用图算法的FPGA设计,例如稀疏矩阵向量乘法(SpMV), PageRank (PR),单源最短路径(SSSP)和弱连接组件(WCC)。我们的目标是一个由大型外部DDR4存储器和Intel Stratix FPGA组成的平台来存储图形数据,以加速处理。实验结果表明,对于SpMV、PR、SSSP和WCC,我们的加速器分别保持高达2250、2300、3378和2178百万遍历边每秒(MTEPS)的高吞吐量。与几种高度优化的多核设计相比,我们的FPGA框架在SpMV上实现了20.5倍的加速,在PR上实现了16.4倍的加速,在SSSP上实现了3.5倍的加速,在WCC上实现了35.1倍的加速,与两种最先进的FPGA框架相比,我们的设计分别在SpMV上实现了5.3倍的加速,在PR上实现了1.64倍的加速,在WCC上实现了1.8倍的加速。我们为GPOP范例开发了一个性能模型。然后,我们假设图形存储在HBM2而不是DRAM中,对我们的设计进行性能预测。我们将进一步讨论优化的扩展,以提高吞吐量。
{"title":"IP Cores for Graph Kernels on FPGAs","authors":"S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna","doi":"10.1109/HPEC.2019.8916363","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916363","url":null,"abstract":"Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128035941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
BLAST: Blockchain-based Trust Management in Smart Cities and Connected Vehicles Setup BLAST:基于区块链的智慧城市信任管理和联网车辆设置
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916229
Farah I. Kandah, Brennan Huber, Amani Altarawneh, Sai Medury, A. Skjellum
Advancement in communication technologies and the Internet of Things (IoT) is driving smart cities adoption that aims to increase operational efficiency of infrastructure, improve the quality of services, and citizen welfare, among other worthy goals. For instance, it is estimated that by 2020, 75% of cars shipped globally will be equipped with hardware to facilitate vehicle connectivity. The privacy, reliability, and integrity of communication must be ensured so that actions can be accurate and implemented promptly after receiving actionable information. Because vehicles are equipped with the ability to compute, communicate, and sense their environment, there is a concomitant critical need to create and maintain trust among network entities in the context of the network’s dynamism, an issue that requires building and validating the trust between entities in a small amount of time before entities leave each other’s range. In this work, we present a multi-tier scheme consisting of an authentication- and trust-building/distribution framework designed with blockchain technology to ensure the safety and validity of the information exchanged in the system. Through simulation, we illustrate the tradeoff between blockchain mining time and the number of blocks being generated as well as the effect of the vehicle speed on the number of blocks being generated.
通信技术和物联网(IoT)的进步正在推动智慧城市的采用,其目的是提高基础设施的运营效率、改善服务质量和公民福利,以及其他有价值的目标。例如,据估计,到2020年,全球75%的汽车将配备硬件,以促进车辆连接。必须确保通信的私密性、可靠性和完整性,以便在收到可操作的信息后能够准确并迅速地实施行动。由于车辆配备了计算、通信和感知环境的能力,因此在网络动态的背景下,迫切需要在网络实体之间建立和维护信任,这个问题需要在实体离开彼此的范围之前的一小段时间内建立和验证实体之间的信任。在这项工作中,我们提出了一个多层方案,由一个用区块链技术设计的认证和信任构建/分发框架组成,以确保系统中交换信息的安全性和有效性。通过仿真,我们说明了区块链挖掘时间和生成的区块数量之间的权衡,以及车辆速度对生成的区块数量的影响。
{"title":"BLAST: Blockchain-based Trust Management in Smart Cities and Connected Vehicles Setup","authors":"Farah I. Kandah, Brennan Huber, Amani Altarawneh, Sai Medury, A. Skjellum","doi":"10.1109/HPEC.2019.8916229","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916229","url":null,"abstract":"Advancement in communication technologies and the Internet of Things (IoT) is driving smart cities adoption that aims to increase operational efficiency of infrastructure, improve the quality of services, and citizen welfare, among other worthy goals. For instance, it is estimated that by 2020, 75% of cars shipped globally will be equipped with hardware to facilitate vehicle connectivity. The privacy, reliability, and integrity of communication must be ensured so that actions can be accurate and implemented promptly after receiving actionable information. Because vehicles are equipped with the ability to compute, communicate, and sense their environment, there is a concomitant critical need to create and maintain trust among network entities in the context of the network’s dynamism, an issue that requires building and validating the trust between entities in a small amount of time before entities leave each other’s range. In this work, we present a multi-tier scheme consisting of an authentication- and trust-building/distribution framework designed with blockchain technology to ensure the safety and validity of the information exchanged in the system. Through simulation, we illustrate the tradeoff between blockchain mining time and the number of blocks being generated as well as the effect of the vehicle speed on the number of blocks being generated.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122018050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Optimizing the Visualization Pipeline of a 3-D Monitoring and Management System 三维监控管理系统可视化管道的优化
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916493
Rebecca Wild, M. Hubbell, J. Kepner
Monitoring and managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data. Making sense of this data and generating a platform where the data can be visualized for system administrators and management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to retool our approach to scale with our systems.
监控和管理高性能计算(HPC)系统和环境会产生越来越多的数据。要理解这些数据并生成一个平台,在这个平台上,系统管理员和管理人员可以对数据进行可视化,以便主动识别系统故障或了解系统状态,这需要该平台与用于存储和分析数据的底层数据库工具一样高效和可扩展。在本文中,我们将展示我们如何利用Accumulo, d4m和Unity来生成一个3D可视化平台来监控和管理林肯实验室的超级计算机系统,以及我们如何不得不重新调整我们的方法来扩展我们的系统。
{"title":"Optimizing the Visualization Pipeline of a 3-D Monitoring and Management System","authors":"Rebecca Wild, M. Hubbell, J. Kepner","doi":"10.1109/HPEC.2019.8916493","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916493","url":null,"abstract":"Monitoring and managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data. Making sense of this data and generating a platform where the data can be visualized for system administrators and management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to retool our approach to scale with our systems.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117178153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Artificial Neural Network and Accelerator Co-design using Evolutionary Algorithms 基于进化算法的人工神经网络和加速器协同设计
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916533
Philip Colangelo, Oren Segal, Alexander Speicher, M. Margala
Multilayer feed-forward Artificial Neural Networks (ANNs) are universal function approximators capable of modeling measurable functions to any desired degree of accuracy. In practice, designing practical, efficient neural network architectures requires significant effort and expertise. Further, designing efficient neural network architectures that fit optimally on hardware for the benefit of acceleration adds yet another degree of complexity. In this paper, we use Evolutionary Cell Aided Design (ECAD), a framework capable of searching the design spaces for ANN structures and reconfigurable hardware to find solutions based on a set of constraints and fitness functions. Providing a modular and scalable 2D systolic array based machine learning accelerator design built for an Arria 10 GX 1150 FPGA device using OpenCL enables results to be tested and deployed in real hardware. Along with the hardware, a software model of the architecture was developed to speed up the evolutionary process. We present results from the ECAD framework showing the effect various optimizations including accuracy, images per second, effective giga-operations per second, and latency have on both ANN and hardware configurations. Through this work we show that unique solutions can exist for each optimization resulting in the best performance. This work lays the foundation for finding machine learning based solutions for a wide range of applications having different system constraints.
多层前馈人工神经网络(ann)是一种通用的函数逼近器,能够以任何期望的精度对可测量函数进行建模。在实践中,设计实用、高效的神经网络架构需要大量的努力和专业知识。此外,为了加速的好处,设计最适合硬件的高效神经网络架构增加了另一个程度的复杂性。在本文中,我们使用进化细胞辅助设计(ECAD),一个能够搜索ANN结构和可重构硬件的设计空间的框架,以找到基于一组约束和适应度函数的解决方案。除了硬件之外,还开发了体系结构的软件模型,以加快进化过程。我们展示了来自ECAD框架的结果,显示了各种优化对ANN和硬件配置的影响,包括精度、每秒图像、每秒有效的千兆操作和延迟。通过这项工作,我们证明了每个优化都可以存在唯一的解决方案,从而获得最佳性能。这项工作为为具有不同系统约束的广泛应用寻找基于机器学习的解决方案奠定了基础。
{"title":"Artificial Neural Network and Accelerator Co-design using Evolutionary Algorithms","authors":"Philip Colangelo, Oren Segal, Alexander Speicher, M. Margala","doi":"10.1109/HPEC.2019.8916533","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916533","url":null,"abstract":"Multilayer feed-forward Artificial Neural Networks (ANNs) are universal function approximators capable of modeling measurable functions to any desired degree of accuracy. In practice, designing practical, efficient neural network architectures requires significant effort and expertise. Further, designing efficient neural network architectures that fit optimally on hardware for the benefit of acceleration adds yet another degree of complexity. In this paper, we use Evolutionary Cell Aided Design (ECAD), a framework capable of searching the design spaces for ANN structures and reconfigurable hardware to find solutions based on a set of constraints and fitness functions. Providing a modular and scalable 2D systolic array based machine learning accelerator design built for an Arria 10 GX 1150 FPGA device using OpenCL enables results to be tested and deployed in real hardware. Along with the hardware, a software model of the architecture was developed to speed up the evolutionary process. We present results from the ECAD framework showing the effect various optimizations including accuracy, images per second, effective giga-operations per second, and latency have on both ANN and hardware configurations. Through this work we show that unique solutions can exist for each optimization resulting in the best performance. This work lays the foundation for finding machine learning based solutions for a wide range of applications having different system constraints.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124462108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2019 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1