首页 > 最新文献

2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
COMET: A Distributed Metadata Service for Federated Cloud Infrastructures COMET:用于联邦云基础设施的分布式元数据服务
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916536
Cong Wang, Komal Thareja, M. Stealey, P. Ruth, I. Baldin
Majority of today’s cloud services are independently operated by individual cloud service providers. In this approach, the locations of cloud resources are strictly constrained by the distribution of cloud service providers’ sites. As the popularity and scale of cloud services increases, we believe this traditional paradigm is about to change toward further federated services, a.k.a., multi-cloud, due the improved performance, reduced cost of compute, storage and network resources, as well as the increased user demands. In this paper, we present COMET, a light weight, distributed storage system for managing metadata on large scale, federated cloud infrastructure providers, end users, and their applications. We use two use cases from NSF’s ExoGENI and Chameleon research cloud testbeds to show the effectiveness of COMET design and deployment.
今天的大多数云服务都是由各个云服务提供商独立运营的。在这种方法中,云资源的位置受到云服务提供商站点分布的严格限制。随着云服务的普及和规模的增加,我们相信由于性能的提高、计算、存储和网络资源成本的降低以及用户需求的增加,这种传统模式将向进一步的联合服务(即多云)转变。在本文中,我们介绍了COMET,这是一个轻量级的分布式存储系统,用于管理大规模、联合云基础设施提供商、最终用户及其应用程序的元数据。我们使用来自NSF的ExoGENI和Chameleon研究云测试平台的两个用例来展示COMET设计和部署的有效性。
{"title":"COMET: A Distributed Metadata Service for Federated Cloud Infrastructures","authors":"Cong Wang, Komal Thareja, M. Stealey, P. Ruth, I. Baldin","doi":"10.1109/HPEC.2019.8916536","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916536","url":null,"abstract":"Majority of today’s cloud services are independently operated by individual cloud service providers. In this approach, the locations of cloud resources are strictly constrained by the distribution of cloud service providers’ sites. As the popularity and scale of cloud services increases, we believe this traditional paradigm is about to change toward further federated services, a.k.a., multi-cloud, due the improved performance, reduced cost of compute, storage and network resources, as well as the increased user demands. In this paper, we present COMET, a light weight, distributed storage system for managing metadata on large scale, federated cloud infrastructure providers, end users, and their applications. We use two use cases from NSF’s ExoGENI and Chameleon research cloud testbeds to show the effectiveness of COMET design and deployment.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129718780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On Computing with Diagonally Structured Matrices 关于对角结构矩阵的计算
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916325
S. Hossain, M. S. Mahmud
We present a storage scheme for storing matrices by diagonals and algorithms for performing matrix-matrix and matrix-vector multiplication by diagonals. Matrix elements are accessed with stride-1 and involve no indirect referencing. Access to the transposed matrix requires no additional effort. The proposed storage scheme handles dense matrices and matrices with special structure e.g., banded, triangular, symmetric in a uniform manner. Test results from preliminary numerical experiments with an OpenMP implementation of our method are encouraging.
我们提出了一种用对角线存储矩阵的存储方案,以及用对角线进行矩阵-矩阵和矩阵-向量乘法的算法。使用stride-1访问矩阵元素,不涉及间接引用。访问转置矩阵不需要额外的努力。所提出的存储方案以统一的方式处理密集矩阵和具有带状、三角形、对称等特殊结构的矩阵。用OpenMP实现的初步数值实验结果令人鼓舞。
{"title":"On Computing with Diagonally Structured Matrices","authors":"S. Hossain, M. S. Mahmud","doi":"10.1109/HPEC.2019.8916325","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916325","url":null,"abstract":"We present a storage scheme for storing matrices by diagonals and algorithms for performing matrix-matrix and matrix-vector multiplication by diagonals. Matrix elements are accessed with stride-1 and involve no indirect referencing. Access to the transposed matrix requires no additional effort. The proposed storage scheme handles dense matrices and matrices with special structure e.g., banded, triangular, symmetric in a uniform manner. Test results from preliminary numerical experiments with an OpenMP implementation of our method are encouraging.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121092317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Message Scheduling for Performant, Many-Core Belief Propagation 高性能多核信念传播的消息调度
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916366
Mark Van der Merwe, Vinu Joseph, Ganesh Gopalakrishnan
Belief Propagation (BP) is a message-passing algorithm for approximate inference over Probabilistic Graphical Models (PGMs), finding many applications such as computer vision, error-correcting codes, and protein-folding. While general, the convergence and speed of the algorithm has limited its practical use on difficult inference problems. As an algorithm that is highly amenable to parallelization, many-core Graphical Processing Units (GPUs) could significantly improve BP performance. Improving BP through many-core systems is non-trivial: the scheduling of messages in the algorithm strongly affects performance. We present a study of message scheduling for BP on GPUs. We demonstrate that BP exhibits a tradeoff between speed and convergence based on parallelism and show that existing message schedulings are not able to utilize this tradeoff. To this end, we present a novel randomized message scheduling approach, Randomized BP (RnBP), which outperforms existing methods on the GPU.
信念传播(BP)是一种基于概率图模型(PGMs)的近似推理的消息传递算法,在计算机视觉、纠错码和蛋白质折叠等领域有许多应用。然而,该算法的收敛性和速度限制了其在复杂推理问题上的实际应用。多核图形处理单元(gpu)作为一种高度并行化的算法,可以显著提高BP的性能。通过多核系统改进BP是非常重要的:算法中的消息调度对性能有很大影响。本文研究了基于gpu的BP消息调度。我们证明了BP在基于并行性的速度和收敛性之间进行了权衡,并表明现有的消息调度无法利用这种权衡。为此,我们提出了一种新的随机消息调度方法,随机BP (random BP, RnBP),该方法在GPU上优于现有方法。
{"title":"Message Scheduling for Performant, Many-Core Belief Propagation","authors":"Mark Van der Merwe, Vinu Joseph, Ganesh Gopalakrishnan","doi":"10.1109/HPEC.2019.8916366","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916366","url":null,"abstract":"Belief Propagation (BP) is a message-passing algorithm for approximate inference over Probabilistic Graphical Models (PGMs), finding many applications such as computer vision, error-correcting codes, and protein-folding. While general, the convergence and speed of the algorithm has limited its practical use on difficult inference problems. As an algorithm that is highly amenable to parallelization, many-core Graphical Processing Units (GPUs) could significantly improve BP performance. Improving BP through many-core systems is non-trivial: the scheduling of messages in the algorithm strongly affects performance. We present a study of message scheduling for BP on GPUs. We demonstrate that BP exhibits a tradeoff between speed and convergence based on parallelism and show that existing message schedulings are not able to utilize this tradeoff. To this end, we present a novel randomized message scheduling approach, Randomized BP (RnBP), which outperforms existing methods on the GPU.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126680398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators 半精度加速器上有限浮点算法迭代细化精度的提高
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916392
P. Luszczek, I. Yamazaki, J. Dongarra
The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases. However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks. We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.
深度学习作为大规模云基础设施上机器学习任务的主要计算工作负载的出现,导致了大量加速器硬件的发布。然而,在这些新平台上,浮点数的精度和范围降低了,这使得利用这些前所未有的计算能力来进行数值线性代数运算成为一项重要的任务,这些运算能力保证了鲁棒的误差界限。为了解决这些问题,我们提出了一些策略,可以用来提高有限精度迭代细化的准确性。所谓有限精度,我们指的是在现代硬件加速器中实现的16位浮点格式,并不一定符合IEEE半精度规范。我们还解释了更广泛的上下文,以及与已建立的IEEE浮点标准和现有高性能计算(HPC)基准的联系。我们还提出了一种新的LU分解公式,我们称之为有符号平方根LU,它产生了更平衡的数字L和U因子,直接解决了低精度存储格式范围有限的问题。实验结果表明,在系统溶液中恢复大量的准确度是可能的,否则会丢失。以前,这只能通过使用基于单精度浮点运算的迭代细化来实现。讨论还将探讨数值稳定性问题,这对这些新硬件平台上的鲁棒线性解算器很重要。
{"title":"Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators","authors":"P. Luszczek, I. Yamazaki, J. Dongarra","doi":"10.1109/HPEC.2019.8916392","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916392","url":null,"abstract":"The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases. However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks. We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131571243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Large Scale Parallelization Using File-Based Communications 使用基于文件的通信的大规模并行化
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916221
C. Byun, J. Kepner, W. Arcand, David Bestor, Bill Bergeron, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther
In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI_Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system.
本文提出了一种利用本地文件系统实现大规模并行化的基于文件的通信体系结构。这种新方法消除了在将中央文件系统用于大型并行作业时出现的文件系统过载和资源争用问题。当发送和接收进程不在同一节点上时,由于节点间消息文件传输,新方法会产生额外的开销。然而,即使有这些额外的开销成本,除了大规模并行作业的消息通信性能增强之外,它对整个集群操作的好处要大得多。例如,当运行一个2048个进程的并行作业时,在使用本地文件系统时,使用MPI_Bcast()实现了大约34倍的性能提升。此外,由于传输消息文件的安全性完全通过使用安全复制协议(scp)和文件系统权限来处理,因此除了HPC系统上通常需要的安全措施或端口外,不需要其他安全措施或端口。
{"title":"Large Scale Parallelization Using File-Based Communications","authors":"C. Byun, J. Kepner, W. Arcand, David Bestor, Bill Bergeron, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther","doi":"10.1109/HPEC.2019.8916221","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916221","url":null,"abstract":"In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI_Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133387725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM 近似矩阵乘法在神经网络和分布式SLAM中的应用
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916468
Brian Plancher, C. Brumar, I. Brumar, Lillian Pentecost, Saketh Rama, D. Brooks
Computational efficiency is a critical constraint for a variety of cutting-edge real-time applications. In this work, we identify an opportunity to speed up the end-to-end runtime of two such compute bound applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous Localization and Mapping (DSLAM). Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication, we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications experience variance in their results. This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used in practice.
计算效率是各种尖端实时应用的关键约束。在这项工作中,我们确定了通过结合近似线性代数技术来加速两个这样的计算绑定应用程序的端到端运行时的机会。特别是,我们将近似矩阵乘法应用于用于图像分类的人工神经网络(NNs)和分布式同时定位和映射(DSLAM)的机器人问题。在最近基于采样的蒙特卡罗近似矩阵乘法策略的基础上,我们开发了更新的理论界限和自适应误差预测策略。然后,我们将这些技术应用于nn和DSLAM的上下文中,将这两个应用程序的速度提高了15-20%,同时在MNIST数据集上运行的nn保持97%的分类精度,并将机器人的平均位置误差保持在1米以下(精确解决方案为0.32米)。然而,这两个应用程序的结果都存在差异。这表明,如果谨慎使用蒙特卡罗矩阵乘法,可能是一种有效的技术,可以减少某些算法的内存和计算负担,但在这些技术能够广泛应用于实践之前,还需要更多的研究。
{"title":"Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM","authors":"Brian Plancher, C. Brumar, I. Brumar, Lillian Pentecost, Saketh Rama, D. Brooks","doi":"10.1109/HPEC.2019.8916468","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916468","url":null,"abstract":"Computational efficiency is a critical constraint for a variety of cutting-edge real-time applications. In this work, we identify an opportunity to speed up the end-to-end runtime of two such compute bound applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous Localization and Mapping (DSLAM). Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication, we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications experience variance in their results. This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used in practice.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133396368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column 基于压缩稀疏列的稀疏深度神经网络多线程分层训练
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916494
M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud
Training a sparse Deep Neural Network (DNN) is inherently less memory-intensive and processor-intensive compared to training a dense (fully-connected) DNN. In this paper, we utilize Sparse Matrix-Matrix Multiplication (SpMM) to train sparsely-connected DNNs as opposed to dense matrix-matrix multiplication used for training dense DNNs. In our C/C++ implementation, we extensively use in-memory Compressed Sparse Column (CSC) data structures to store and traverse the neural network layers. Also, we train the neural network layer by layer, and within each layer we use 1D-Column partitioning to divide the computation required for training among threads. To speedup the computation, we apply the bias and activation functions while executing SpMM operations. We tested our implementation using benchmarks provided by MIT/IEEE/Amazon HPEC graph challenge [1]. Based on our results, our single thread (1 core) and multithreaded (12 cores) implementations are up to $22 times$, and $150 times$ faster than the serial Matlab results provided by the challenge. We believe this speedup is due to the 1D-Column partitioning that we use to balance the computation of SpMM operations among computing threads, the efficient mechanism that we use for memory (re)allocation of sparse matrices, and the overlapping of the accumulation of SpMM results with the application of the bias and activation functions.
与训练密集(全连接)深度神经网络相比,训练稀疏深度神经网络(DNN)本质上是更少的内存密集型和处理器密集型。在本文中,我们使用稀疏矩阵-矩阵乘法(SpMM)来训练稀疏连接的dnn,而不是用于训练密集dnn的密集矩阵-矩阵乘法。在我们的C/ c++实现中,我们广泛使用内存中的压缩稀疏列(CSC)数据结构来存储和遍历神经网络层。此外,我们一层一层地训练神经网络,在每一层中,我们使用1D-Column分区来划分线程之间训练所需的计算量。为了加快计算速度,我们在执行SpMM操作时应用偏置和激活函数。我们使用MIT/IEEE/Amazon HPEC图形挑战[1]提供的基准测试来测试我们的实现。根据我们的结果,我们的单线程(1核)和多线程(12核)实现比该挑战提供的串行Matlab结果快22倍,快150倍。我们认为这种加速是由于我们用于平衡计算线程之间SpMM操作的计算的1D-Column分区,我们用于稀疏矩阵的内存(重新)分配的有效机制,以及SpMM结果累积与应用偏差和激活函数的重叠。
{"title":"Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column","authors":"M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud","doi":"10.1109/HPEC.2019.8916494","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916494","url":null,"abstract":"Training a sparse Deep Neural Network (DNN) is inherently less memory-intensive and processor-intensive compared to training a dense (fully-connected) DNN. In this paper, we utilize Sparse Matrix-Matrix Multiplication (SpMM) to train sparsely-connected DNNs as opposed to dense matrix-matrix multiplication used for training dense DNNs. In our C/C++ implementation, we extensively use in-memory Compressed Sparse Column (CSC) data structures to store and traverse the neural network layers. Also, we train the neural network layer by layer, and within each layer we use 1D-Column partitioning to divide the computation required for training among threads. To speedup the computation, we apply the bias and activation functions while executing SpMM operations. We tested our implementation using benchmarks provided by MIT/IEEE/Amazon HPEC graph challenge [1]. Based on our results, our single thread (1 core) and multithreaded (12 cores) implementations are up to $22 times$, and $150 times$ faster than the serial Matlab results provided by the challenge. We believe this speedup is due to the 1D-Column partitioning that we use to balance the computation of SpMM operations among computing threads, the efficient mechanism that we use for memory (re)allocation of sparse matrices, and the overlapping of the accumulation of SpMM results with the application of the bias and activation functions.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132950918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Scalable Solvers for Cone Complementarity Problems in Frictional Multibody Dynamics 摩擦多体动力学中锥体互补问题的可扩展求解方法
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916234
Saibal De, Eduardo Corona, P. Jayakumar, S. Veerapaneni
We present an efficient, hybrid MPI/OpenMP framework for the cone complementarity formulation of large-scale rigid body dynamics problems with frictional contact. Data is partitioned among MPI processes using a Morton encoding in order to promote data locality and minimize communication. We parallelize the state-of-the-art first and second-order solvers for the resulting cone complementarity optimization problems. Our approach is highly scalable, enabling the solution of dense, large-scale multibody problems; a sedimentation simulation involving 256 million particles ($sim 324$ million contacts on average) was resolved using 512 cores in less than half-hour per time-step.
我们提出了一个高效的混合MPI/OpenMP框架,用于具有摩擦接触的大型刚体动力学问题的锥互补公式。数据在MPI进程之间使用Morton编码进行分区,以提高数据局部性并减少通信。我们将最先进的一阶和二阶求解器并行化以求解所得到的锥互补优化问题。我们的方法具有高度可扩展性,可以解决密集的大规模多体问题;在每个时间步长不到半小时的时间内,使用512个岩心解决了涉及2.56亿个颗粒(平均为3.24亿美元)的沉降模拟。
{"title":"Scalable Solvers for Cone Complementarity Problems in Frictional Multibody Dynamics","authors":"Saibal De, Eduardo Corona, P. Jayakumar, S. Veerapaneni","doi":"10.1109/HPEC.2019.8916234","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916234","url":null,"abstract":"We present an efficient, hybrid MPI/OpenMP framework for the cone complementarity formulation of large-scale rigid body dynamics problems with frictional contact. Data is partitioned among MPI processes using a Morton encoding in order to promote data locality and minimize communication. We parallelize the state-of-the-art first and second-order solvers for the resulting cone complementarity optimization problems. Our approach is highly scalable, enabling the solution of dense, large-scale multibody problems; a sedimentation simulation involving 256 million particles ($sim 324$ million contacts on average) was resolved using 512 cores in less than half-hour per time-step.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131226322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
FFTX for Micromechanical Stress-Strain Analysis 微机械应力-应变分析FFTX
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916267
Anuva Kulkarni, Daniele G. Spampinato, F. Franchetti
Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which is handled by a backend code generator.
将科学模拟移植到异构平台需要复杂的算法和优化策略来克服内存和通信瓶颈。这些操作使用传统库(例如,光谱方法的FFTW)是无法表达的,并且难以针对各种硬件平台手动优化。在这项工作中,我们使用我们的gpu适应应力应变分析方法来展示如何使用FFTX(扩展FFTW的新API)来表达我们的算法,而无需担心代码优化,这是由后端代码生成器处理的。
{"title":"FFTX for Micromechanical Stress-Strain Analysis","authors":"Anuva Kulkarni, Daniele G. Spampinato, F. Franchetti","doi":"10.1109/HPEC.2019.8916267","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916267","url":null,"abstract":"Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which is handled by a backend code generator.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122059548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emerging Applications of 3D Integration and Approximate Computing in High-Performance Computing Systems: Unique Security Vulnerabilities 三维集成和近似计算在高性能计算系统中的新兴应用:独特的安全漏洞
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916503
Pruthvy Yellu, Zhiming Zhang, M. Monjur, Ranuli Abeysinghe, Qiaoyan Yu
High-performance computing (HPC) systems rely on new technologies such as emerging devices, advanced integration techniques, and computing architecture to continue advancing performance. The adoption of new techniques could potentially leave high-performance computing systems vulnerable to new security threats. This work analyzes the security challenges in the HPC systems that employ three-dimensional integrated circuits and approximating computing. Case studies are provided to show the impact of new security threats on the system integrity and highlight the urgent need for new security measures.
高性能计算(HPC)系统依赖于新兴设备、先进集成技术和计算体系结构等新技术来继续提高性能。采用新技术可能会使高性能计算系统容易受到新的安全威胁。本文分析了采用三维集成电路和近似计算的高性能计算系统所面临的安全挑战。个案研究显示新的保安威胁对系统完整性的影响,并强调对新保安措施的迫切需要。
{"title":"Emerging Applications of 3D Integration and Approximate Computing in High-Performance Computing Systems: Unique Security Vulnerabilities","authors":"Pruthvy Yellu, Zhiming Zhang, M. Monjur, Ranuli Abeysinghe, Qiaoyan Yu","doi":"10.1109/HPEC.2019.8916503","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916503","url":null,"abstract":"High-performance computing (HPC) systems rely on new technologies such as emerging devices, advanced integration techniques, and computing architecture to continue advancing performance. The adoption of new techniques could potentially leave high-performance computing systems vulnerable to new security threats. This work analyzes the security challenges in the HPC systems that employ three-dimensional integrated circuits and approximating computing. Case studies are provided to show the impact of new security threats on the system integrity and highlight the urgent need for new security measures.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114786475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2019 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1