首页 > 最新文献

2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Progressive Optimization of Batched LU Factorization on GPUs gpu上批量LU分解的渐进式优化
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916270
A. Abdelfattah, S. Tomov, J. Dongarra
This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28×/2.69× against the generic LAPACK-style implementation. It is also up to 8.72×/7.2× faster than the cuBLAS library for single and double precisions, respectively. The developed solution is integrated into the open-source MAGMA library.
本文提出了一种优化图形处理器(gpu)上批量LU分解的渐进式方法。本文表明,依赖于3级BLAS例程的性能并没有真正得到回报,并且注意算法的内存绑定部分确实很重要,特别是当问题规模非常小的时候。在这种情况下,我们开发了一种大小感知的多级块技术,根据问题的大小利用不同的粒度进行核融合。我们在Tesla V100 GPU上进行的实验表明,与一般的lapack风格实现相比,多级阻塞技术可以实现高达3.28×/2.69×的单/双精度加速。在单精度和双精度方面,它也比cuBLAS库分别快8.72倍/7.2倍。开发的解决方案集成到开源的MAGMA库中。
{"title":"Progressive Optimization of Batched LU Factorization on GPUs","authors":"A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2019.8916270","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916270","url":null,"abstract":"This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28×/2.69× against the generic LAPACK-style implementation. It is also up to 8.72×/7.2× faster than the cuBLAS library for single and double precisions, respectively. The developed solution is integrated into the open-source MAGMA library.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1996 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127322379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A data-driven framework for uncertainty quantification of a fluidized bed 流化床不确定度量化的数据驱动框架
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916467
V. Kotteda, Anitha Kommu, Vinod Kumar
We carried out a nondeterministic analysis of flow in a fluidized bed. The flow in the fluidized bed is simulated with National Energy Technology Laboratory’s open-source multiphase fluid dynamics suite MFiX. It does not possess tools for uncertainty quantification. Therefore, we developed a C++ wrapper to integrate an uncertainty quantification toolkit developed at Sandia National Laboratory with MFiX. The wrapper exchanges uncertain input parameters and key output parameters among Dakota and MFiX. However, a data-driven framework is also developed to obtain reliable statistics as it is not feasible to get them with MFiX integrated into Dakota, Dakota-MFiX. The data generated from Dakota-MFiX simulations, with the Latin Hypercube method of sampling size 500, is used to train a machine-learning algorithm. The trained and tested deep neural network algorithm is integrated with Dakota via the wrapper to obtain low order statistics of the bed height and pressure drop across the bed.
我们对流化床中的流动进行了不确定性分析。利用国家能源技术实验室多相流体动力学软件MFiX对流化床内的流动进行了模拟。它不具备不确定度量化的工具。因此,我们开发了一个c++包装器,将桑迪亚国家实验室开发的不确定性量化工具包与MFiX集成在一起。包装器在Dakota和MFiX之间交换不确定的输入参数和关键的输出参数。然而,由于无法将MFiX集成到Dakota, Dakota-MFiX中,因此还开发了一个数据驱动的框架来获得可靠的统计数据。从Dakota-MFiX模拟中生成的数据,采用拉丁超立方体方法,采样大小为500,用于训练机器学习算法。经过训练和测试的深度神经网络算法通过包装器与Dakota集成,以获得床层高度和床层压降的低阶统计数据。
{"title":"A data-driven framework for uncertainty quantification of a fluidized bed","authors":"V. Kotteda, Anitha Kommu, Vinod Kumar","doi":"10.1109/HPEC.2019.8916467","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916467","url":null,"abstract":"We carried out a nondeterministic analysis of flow in a fluidized bed. The flow in the fluidized bed is simulated with National Energy Technology Laboratory’s open-source multiphase fluid dynamics suite MFiX. It does not possess tools for uncertainty quantification. Therefore, we developed a C++ wrapper to integrate an uncertainty quantification toolkit developed at Sandia National Laboratory with MFiX. The wrapper exchanges uncertain input parameters and key output parameters among Dakota and MFiX. However, a data-driven framework is also developed to obtain reliable statistics as it is not feasible to get them with MFiX integrated into Dakota, Dakota-MFiX. The data generated from Dakota-MFiX simulations, with the Latin Hypercube method of sampling size 500, is used to train a machine-learning algorithm. The trained and tested deep neural network algorithm is integrated with Dakota via the wrapper to obtain low order statistics of the bed height and pressure drop across the bed.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126909919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels 基于Kokkos核的稀疏深度神经网络的可扩展推理
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916378
J. Ellis, S. Rajamanickam
Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.
在过去的十年里,硬件的进步使得训练和推理非常大的深度神经网络成为可能。稀疏化深度神经网络(dnn)在控制精度损失的前提下,可以大大降低标准深度神经网络的存储成本和提高吞吐量。IEEE HPEC稀疏深度神经网络图挑战赛作为算法和实现进步的测试平台,以最大限度地提高稀疏深度神经网络的计算性能。我们将dnn的稀疏网络KK-SpDNN建立在Kokkos内核库中的稀疏线性代数内核上。在Kokkos kernel中使用稀疏矩阵-矩阵乘法允许我们重用高度优化的内核。我们专注于减少12个稀疏网络的单节点和多节点运行时间。我们在英特尔Skylake和Knights Landing架构上测试了KK-SpDNN,发现单节点性能比串行参考实现提高了120-500倍。为了进一步加快网络推理速度,我们采用MPI数据并行模式运行,最终在20个Skylake节点上获得了1.16e+12的边缘处理速率。与我们在单个Skylake节点上高度优化的多线程实现相比,这意味着在20个节点上的速度提高了13倍。
{"title":"Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels","authors":"J. Ellis, S. Rajamanickam","doi":"10.1109/HPEC.2019.8916378","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916378","url":null,"abstract":"Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115584051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs 探索OpenCL管道在云fpga上隐藏内存延迟的效率
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916236
Arnab A. Purkayastha, S. Raghavendran, Jhanani Thiagarajan, H. Tabkhi
OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization.This paper explores the efficiency of “OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into “read”, “compute” and “write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS).11This work has been funded and supported by the Xilinx University Program (XUP)..
OpenCL编程能力与OpenCL高级综合(OpenCL- hls)工具的结合在可重构计算领域取得了巨大的进步。fpga固有的流水线并行能力不仅提供了更快的执行时间,而且在执行大规模并行应用程序时提供了节能的解决方案。影响FPGA性能的一个主要执行瓶颈是暴露给流水线数据路径的大量内存中断,这阻碍了数据路径定制的好处。本文探讨了“OpenCL管道”通过将内存访问与计算解耦来隐藏云fpga上内存访问延迟的效率。Pipe语义被用来将OpenCL内核拆分为“读”、“计算”和“回写”子内核,这些子内核同时工作,使当前线程的计算与未来线程的内存访问重叠。为了进行评估,我们使用了来自Rodinia套件和3.1的七个大规模并行高性能应用程序的组合。我们所有的测试都是在基于亚马逊云的AWS EC2 F1实例的Xilinx VU9FP FPGA平台上进行的。平均而言,我们观察到与基线合成(Xilinx OpenCL-HLS)相比,速度提高了5.2倍,内存带宽利用率提高了2.2倍,FPGA资源利用率提高了2.5倍。这项工作得到了赛灵思大学项目(XUP)的资助和支持。
{"title":"Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs","authors":"Arnab A. Purkayastha, S. Raghavendran, Jhanani Thiagarajan, H. Tabkhi","doi":"10.1109/HPEC.2019.8916236","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916236","url":null,"abstract":"OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization.This paper explores the efficiency of “OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into “read”, “compute” and “write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS).11This work has been funded and supported by the Xilinx University Program (XUP)..","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131352823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable Triangle Counting on Distributed-Memory Systems 分布式内存系统上的可伸缩三角计数
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916302
Seher Acer, Abdurrahman Yasar, S. Rajamanickam, Michael M. Wolf, Ümit V. Çatalyürek
Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.
三角计数是网络科学中一个基本的图分析核心。这也是“静态图形挑战赛”的挑战问题之一。在这项工作中,我们提出了一种基于线性代数公式的新型混合并行三角形计数算法。我们的框架分别使用MPI和Cilk来利用分布式内存和共享内存并行性的优势。该问题在MPI进程之间使用二维(2D)笛卡尔块分区进行分区。使用Cilk编程模型,在笛卡尔块中使用一维(1D)逐行分区来实现共享内存并行性。除了在几乎所有测试图中表现出非常好的强缩放行为外,我们的算法在1.4B边缘真实twitter图上实现了最快的时间,在1,092个内核上达到3.217秒。与过去图表挑战的分布式内存并行获胜者相比,我们在这个twitter图表上展示了2.7倍的速度提升。当twitter图没有被复制时,这也是在twitter图上进行平行三角形计数的最快时间。
{"title":"Scalable Triangle Counting on Distributed-Memory Systems","authors":"Seher Acer, Abdurrahman Yasar, S. Rajamanickam, Michael M. Wolf, Ümit V. Çatalyürek","doi":"10.1109/HPEC.2019.8916302","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916302","url":null,"abstract":"Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133160698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Fast Triangle Counting on GPU 快速三角形计数GPU
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916216
Chuangyi Gui, Long Zheng, Pengcheng Yao, Xiaofei Liao, Hai Jin
Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of-the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to $5.9 times $ speedup with $gt 10^{9}$ TEPS on a single GPU for many large real graphs from graph challenge datasets.
三角形计数是最基本的图形应用程序之一,用于解决各种领域中的许多实际问题。探索图形处理单元(GPU)的大规模并行性来加速三角形计数是流行的。我们发现,专注于改善负载平衡的最先进的基于gpu的研究仍然在降低性能方面表现出大量的随机访问。本文设计了一种预取方案,将处理顶点的邻居列表提前缓冲在快速共享内存中,以避免随机全局内存访问的高延迟。同时,我们采用了基于度的图重排序技术,并设计了一个简单的启发式算法来均匀分配工作负载。与去年最先进的HEPC图形挑战冠军相比,我们在单个GPU上使用$ $ gt 10^{9}$ TEPS将三角形计数的性能提高了5.9倍,用于来自图形挑战数据集的许多大型真实图形。
{"title":"Fast Triangle Counting on GPU","authors":"Chuangyi Gui, Long Zheng, Pengcheng Yao, Xiaofei Liao, Hai Jin","doi":"10.1109/HPEC.2019.8916216","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916216","url":null,"abstract":"Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of-the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to $5.9 times $ speedup with $gt 10^{9}$ TEPS on a single GPU for many large real graphs from graph challenge datasets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133540336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lossless Compression of Internal Files in Parallel Reservoir Simulation 并行油藏模拟中内部文件的无损压缩
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916298
M. Rogowski, Suha N. Kayum, F. Mannuß
In parallel reservoir simulation, massively sized files are written recurrently throughout a simulation run. A method is developed to compress the distributed data to be written during the simulation run and to output it to a single compressed file. Evaluation of several compression algorithms on a range of simulation models is performed. The presented method results in 3x file size reduction and a decrease in the total application runtime.
在并行油藏模拟中,在整个模拟运行过程中反复写入大量文件。提出了一种将仿真运行过程中写入的分布式数据进行压缩并输出到单个压缩文件的方法。在一系列仿真模型上对几种压缩算法进行了评估。所提出的方法使文件大小减少了3倍,并减少了整个应用程序运行时。
{"title":"Lossless Compression of Internal Files in Parallel Reservoir Simulation","authors":"M. Rogowski, Suha N. Kayum, F. Mannuß","doi":"10.1109/HPEC.2019.8916298","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916298","url":null,"abstract":"In parallel reservoir simulation, massively sized files are written recurrently throughout a simulation run. A method is developed to compress the distributed data to be written during the simulation run and to output it to a single compressed file. Evaluation of several compression algorithms on a range of simulation models is performed. The presented method results in 3x file size reduction and a decrease in the total application runtime.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132998017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Cyber Situation Awareness via High Performance Computing 基于高性能计算的主动网络态势感知
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916528
A. Wollaber, Jaime Peña, Benjamin Blease, Leslie Shing, Kenneth Alperin, Serge Vilvovsky, P. Trepagnier, Neal Wagner, Leslie Leonard
Cyber situation awareness technologies have largely been focused on present-state conditions, with limited abilities to forward-project nominal conditions in a contested environment. We demonstrate an approach that uses data-driven, high performance computing (HPC) simulations of attacker/defender activities in a logically connected network environment that enables this capability for interactive, operational decision making in real time. Our contributions are three-fold: (1) we link live cyber data to inform the parameters of a cybersecurity model, (2) we perform HPC simulations and optimizations with a genetic algorithm to evaluate and recommend risk remediation strategies that inhibit attacker lateral movement, and (3) we provide a prototype platform to allow cyber defenders to assess the value of their own alternative risk reduction strategies on a relevant timeline. We present an overview of the data and software architectures, and results are presented that demonstrate operational utility alongside HPC-enabled runtimes.
网络态势感知技术在很大程度上集中于当前状态,在竞争环境中对项目名义条件的预测能力有限。我们演示了一种方法,该方法在逻辑连接的网络环境中使用数据驱动的高性能计算(HPC)模拟攻击者/防御者的活动,使这种能力能够实时进行交互式操作决策。我们的贡献有三个方面:(1)我们将实时网络数据链接起来,以告知网络安全模型的参数;(2)我们使用遗传算法进行HPC模拟和优化,以评估和推荐抑制攻击者横向移动的风险补救策略;(3)我们提供了一个原型平台,允许网络防御者在相关时间轴上评估他们自己的替代风险降低策略的价值。我们概述了数据和软件体系结构,并展示了与支持hpc的运行时一起使用的操作实用程序。
{"title":"Proactive Cyber Situation Awareness via High Performance Computing","authors":"A. Wollaber, Jaime Peña, Benjamin Blease, Leslie Shing, Kenneth Alperin, Serge Vilvovsky, P. Trepagnier, Neal Wagner, Leslie Leonard","doi":"10.1109/HPEC.2019.8916528","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916528","url":null,"abstract":"Cyber situation awareness technologies have largely been focused on present-state conditions, with limited abilities to forward-project nominal conditions in a contested environment. We demonstrate an approach that uses data-driven, high performance computing (HPC) simulations of attacker/defender activities in a logically connected network environment that enables this capability for interactive, operational decision making in real time. Our contributions are three-fold: (1) we link live cyber data to inform the parameters of a cybersecurity model, (2) we perform HPC simulations and optimizations with a genetic algorithm to evaluate and recommend risk remediation strategies that inhibit attacker lateral movement, and (3) we provide a prototype platform to allow cyber defenders to assess the value of their own alternative risk reduction strategies on a relevant timeline. We present an overview of the data and software architectures, and results are presented that demonstrate operational utility alongside HPC-enabled runtimes.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Auxiliary Maximum Likelihood Estimation for Noisy Point Cloud Registration 噪声点云配准的辅助最大似然估计
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916224
Cole Campton, Xiaobai Sun
We establish first a theoretical foundation for the use of Gromov-Hausdorff (GH) distance for point set registration with homeomorphic deformation maps perturbed by Gaussian noise. We then present a probabilistic, deformable registration framework. At the core of the framework is a highly efficient iterative algorithm with guaranteed convergence to a local minimum of the GH-based objective function. The framework has two other key components – a multi-scale stochastic shape descriptor and a data compression scheme. We also present an experimental comparison between our method and two existing influential methods on non-rigid motion between digital anthropomorphic phantoms extracted from physical data of multiple individuals.
本文首先为利用Gromov-Hausdorff (GH)距离对高斯噪声干扰下的同胚变形映射进行点集配准奠定了理论基础。然后,我们提出了一个概率的、可变形的配准框架。该框架的核心是一种高效的迭代算法,保证收敛到基于gh的目标函数的局部最小值。该框架还有另外两个关键组件——一个多尺度随机形状描述符和一个数据压缩方案。我们还将我们的方法与现有的两种有影响的方法进行了实验比较,以研究从多个个体的物理数据中提取的数字拟人幻影之间的非刚性运动。
{"title":"Auxiliary Maximum Likelihood Estimation for Noisy Point Cloud Registration","authors":"Cole Campton, Xiaobai Sun","doi":"10.1109/HPEC.2019.8916224","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916224","url":null,"abstract":"We establish first a theoretical foundation for the use of Gromov-Hausdorff (GH) distance for point set registration with homeomorphic deformation maps perturbed by Gaussian noise. We then present a probabilistic, deformable registration framework. At the core of the framework is a highly efficient iterative algorithm with guaranteed convergence to a local minimum of the GH-based objective function. The framework has two other key components – a multi-scale stochastic shape descriptor and a data compression scheme. We also present an experimental comparison between our method and two existing influential methods on non-rigid motion between digital anthropomorphic phantoms extracted from physical data of multiple individuals.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134328438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Concurrent Katz Centrality for Streaming Graphs 流图的并发Katz中心性
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916572
Chunxing Yin, E. J. Riedy
Most current frameworks for streaming graph analysis “stop the world” and halt ingesting data while updating analysis results. Others add overhead for different forms of version control. In both methods, adding additional analysis kernels adds additional overhead to the entire system. A new formal model of concurrent analysis lets some algorithms, those valid for the model, update results concurrently with data ingest without synchronization. Additional kernels incur very little overhead. Here we present the first experimental results for the new model, considering the performance and result latency of updating Katz centrality on a low-power edge platform. The Katz centrality values remain close to the synchronous algorithm while reducing latency delay from 12.8$times $ to 179$times $.
大多数当前的流图分析框架“停止世界”,在更新分析结果时停止摄取数据。另一些则为不同形式的版本控制增加了开销。在这两种方法中,添加额外的分析内核会给整个系统增加额外的开销。一种新的形式化的并发分析模型允许一些对该模型有效的算法与数据摄取同时更新结果,而无需同步。额外的内核只会产生很少的开销。在这里,我们给出了新模型的第一个实验结果,考虑了在低功耗边缘平台上更新Katz中心性的性能和结果延迟。Katz中心性值仍然接近同步算法,同时将延迟从12.8$times $减少到179$times $。
{"title":"Concurrent Katz Centrality for Streaming Graphs","authors":"Chunxing Yin, E. J. Riedy","doi":"10.1109/HPEC.2019.8916572","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916572","url":null,"abstract":"Most current frameworks for streaming graph analysis “stop the world” and halt ingesting data while updating analysis results. Others add overhead for different forms of version control. In both methods, adding additional analysis kernels adds additional overhead to the entire system. A new formal model of concurrent analysis lets some algorithms, those valid for the model, update results concurrently with data ingest without synchronization. Additional kernels incur very little overhead. Here we present the first experimental results for the new model, considering the performance and result latency of updating Katz centrality on a low-power edge platform. The Katz centrality values remain close to the synchronous algorithm while reducing latency delay from 12.8$times $ to 179$times $.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114210157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2019 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1