首页 > 最新文献

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Scalable yet Rigorous Floating-Point Error Analysis 可扩展但严格的浮点误差分析
Arnab Das, Ian Briggs, G. Gopalakrishnan, S. Krishnamoorthy, P. Panchekha
Automated techniques for rigorous floating-point round-off error analysis are a prerequisite to placing important activities in HPC such as precision allocation, verification, and code optimization on a formal footing. Yet existing techniques cannot provide tight bounds for expressions beyond a few dozen operators–barely enough for HPC. In this work, we offer an approach embedded in a new tool called SATIHE that scales error analysis by four orders of magnitude compared to today’s best-of-class tools. We explain how three key ideas underlying SATIHE helps it attain such scale: path strength reduction, bound optimization, and abstraction. SATIHE provides tight bounds and rigorous guarantees on significantly larger expressions with well over a hundred thousand operators, covering important examples including FFT, matrix multiplication, and PDE stencils.
严格的浮点舍入误差分析的自动化技术是将精度分配、验证和代码优化等重要活动置于正式基础之上的先决条件。然而,现有的技术无法为超过几十个运算符的表达式提供严格的边界——这对HPC来说几乎不够。在这项工作中,我们提供了一种嵌入到一个名为SATIHE的新工具中的方法,与当今一流的工具相比,该工具将误差分析的规模提高了四个数量级。我们解释了SATIHE背后的三个关键思想是如何帮助它达到这样的规模的:路径强度缩减、边界优化和抽象。SATIHE为具有超过十万个运算符的大型表达式提供了严格的边界和严格的保证,涵盖了FFT、矩阵乘法和PDE模板等重要示例。
{"title":"Scalable yet Rigorous Floating-Point Error Analysis","authors":"Arnab Das, Ian Briggs, G. Gopalakrishnan, S. Krishnamoorthy, P. Panchekha","doi":"10.1109/SC41405.2020.00055","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00055","url":null,"abstract":"Automated techniques for rigorous floating-point round-off error analysis are a prerequisite to placing important activities in HPC such as precision allocation, verification, and code optimization on a formal footing. Yet existing techniques cannot provide tight bounds for expressions beyond a few dozen operators–barely enough for HPC. In this work, we offer an approach embedded in a new tool called SATIHE that scales error analysis by four orders of magnitude compared to today’s best-of-class tools. We explain how three key ideas underlying SATIHE helps it attain such scale: path strength reduction, bound optimization, and abstraction. SATIHE provides tight bounds and rigorous guarantees on significantly larger expressions with well over a hundred thousand operators, covering important examples including FFT, matrix multiplication, and PDE stencils.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114714853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations Kraken:大规模实时推荐的高效内存持续学习
Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, Bihai Wu, Jiazhen Lin, H. Ao, Wanhong Xu, J. Shu
Modern recommendation systems in industry often use deep learning (DL) models that achieve better model accuracy with more data and model parameters. However, current opensource DL frameworks, such as TensorFlow and PyTorch, show relatively low scalability on training recommendation models with terabytes of parameters. To efficiently learn large-scale recommendation models from data streams that generate hundreds of terabytes training data daily, we introduce a continual learning system called Kraken. Kraken contains a special parameter server implementation that dynamically adapts to the rapidly changing set of sparse features for the continual training and serving of recommendation models. Kraken provides a sparsity-aware training system that uses different learning optimizers for dense and sparse parameters to reduce memory overhead. Extensive experiments using real-world datasets confirm the effectiveness and scalability of Kraken. Kraken can benefit the accuracy of recommendation tasks with the same memory resources, or trisect the memory usage while keeping model performance.
工业中的现代推荐系统通常使用深度学习(DL)模型,通过更多的数据和模型参数实现更好的模型精度。然而,目前的开源深度学习框架,如TensorFlow和PyTorch,在训练推荐模型时显示出相对较低的可扩展性。为了有效地从每天生成数百tb训练数据的数据流中学习大规模推荐模型,我们引入了一个名为Kraken的持续学习系统。Kraken包含一个特殊的参数服务器实现,它可以动态地适应快速变化的稀疏特征集,以持续训练和服务推荐模型。Kraken提供了一个稀疏感知的训练系统,它使用不同的学习优化器来处理密集和稀疏参数,以减少内存开销。使用真实世界数据集的大量实验证实了Kraken的有效性和可扩展性。Kraken可以使用相同的内存资源来提高推荐任务的准确性,或者在保持模型性能的同时将内存使用分成三部分。
{"title":"Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations","authors":"Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, Bihai Wu, Jiazhen Lin, H. Ao, Wanhong Xu, J. Shu","doi":"10.1109/SC41405.2020.00025","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00025","url":null,"abstract":"Modern recommendation systems in industry often use deep learning (DL) models that achieve better model accuracy with more data and model parameters. However, current opensource DL frameworks, such as TensorFlow and PyTorch, show relatively low scalability on training recommendation models with terabytes of parameters. To efficiently learn large-scale recommendation models from data streams that generate hundreds of terabytes training data daily, we introduce a continual learning system called Kraken. Kraken contains a special parameter server implementation that dynamically adapts to the rapidly changing set of sparse features for the continual training and serving of recommendation models. Kraken provides a sparsity-aware training system that uses different learning optimizers for dense and sparse parameters to reduce memory overhead. Extensive experiments using real-world datasets confirm the effectiveness and scalability of Kraken. Kraken can benefit the accuracy of recommendation tasks with the same memory resources, or trisect the memory usage while keeping model performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129846628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
SegAlign: A Scalable GPU-Based Whole Genome Aligner SegAlign:一个可扩展的基于gpu的全基因组比对器
Sneha D. Goenka, Yatish Turakhia, B. Paten, M. Horowitz
Pairwise Whole Genome Alignment (WGA) is a crucial first step to understanding evolution at the DNA sequence-level. Pairwise WGA of thousands of currently available species genomes could help make biological discoveries, however, computing them for even a fraction of the millions of possible pairs is prohibitive – WGA of a single pair of vertebrate genomes (human-mouse) takes 11 hours on a 96-core Amazon Web Services (AWS) instance (c5.24xlarge). This paper presents SegAlign – a scalable, GPU-accelerated system for computing pairwise WGA. SegAlign is based on the standard seed-filter-extend heuristic, in which the filtering stage dominates the runtime (e.g. 98% for human-mouse WGA), and is accelerated using GPU(s). Using three vertebrate genome pairs, we show that SegAlign provides a speedup of up to $14 times $ on an 8-GPU, 64-core AWS instance (p3.16xlarge) for WGA and nearly $2.3 times $ reduction in dollar cost. SegAlign also allows parallelization over multiple GPU nodes and scales efficiently.
成对全基因组比对(Pairwise Whole Genome Alignment, WGA)是在DNA序列水平上理解进化的关键的第一步。对目前可用的数千种物种基因组的成对WGA可以帮助进行生物学发现,然而,对数百万对可能的物种基因组中的一小部分进行计算是令人生畏的——对单个脊椎动物基因组(人类-小鼠)进行WGA需要在96核Amazon Web Services (AWS)实例上花费11个小时(c5.24xlarge)。本文介绍了SegAlign -一个可扩展的,gpu加速的系统,用于计算成对的WGA。SegAlign基于标准的种子过滤器扩展启发式,其中过滤阶段占运行时的主导地位(例如,人鼠WGA为98%),并使用GPU加速。使用三个脊椎动物基因组对,我们表明SegAlign在8个gpu, 64核AWS实例(p3.16xlarge)上为WGA提供了高达14倍的加速,并将美元成本降低了近2.3倍。SegAlign还允许在多个GPU节点上并行化并有效扩展。
{"title":"SegAlign: A Scalable GPU-Based Whole Genome Aligner","authors":"Sneha D. Goenka, Yatish Turakhia, B. Paten, M. Horowitz","doi":"10.1109/SC41405.2020.00043","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00043","url":null,"abstract":"Pairwise Whole Genome Alignment (WGA) is a crucial first step to understanding evolution at the DNA sequence-level. Pairwise WGA of thousands of currently available species genomes could help make biological discoveries, however, computing them for even a fraction of the millions of possible pairs is prohibitive – WGA of a single pair of vertebrate genomes (human-mouse) takes 11 hours on a 96-core Amazon Web Services (AWS) instance (c5.24xlarge). This paper presents SegAlign – a scalable, GPU-accelerated system for computing pairwise WGA. SegAlign is based on the standard seed-filter-extend heuristic, in which the filtering stage dominates the runtime (e.g. 98% for human-mouse WGA), and is accelerated using GPU(s). Using three vertebrate genome pairs, we show that SegAlign provides a speedup of up to $14 times $ on an 8-GPU, 64-core AWS instance (p3.16xlarge) for WGA and nearly $2.3 times $ reduction in dollar cost. SegAlign also allows parallelization over multiple GPU nodes and scales efficiently.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132274172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives 大型消息邻居群的分层和负载感知设计
S. M. Ghazimirsaeed, Qinghua Zhou, Amit Ruhela, Mohammadreza Bayatpour
The MPI-3.0 standard introduced neighborhood collective to support sparse communication patterns used in many applications. In this paper, we propose a hierarchical and distributed graph topology that considers the physical topology of the system and the virtual communication pattern of processes to improve the performance of large message neighborhood collectives. Moreover, we propose two design alternatives on top of the hierarchical design: 1. LAG-H: assumes the same communication load for all processes, 2. LAW-H: considers the communication load of processes for fair distribution of load between them. We propose a mathematical model to determine the communication capacity of each process. Then, we use the derived capacity to fairly distribute the load between processes. Our experimental results on up to 28,672 processes show up to 9x speedup for various process topologies. We also observe up to 8.2% performance gain and 34x speedup for NAS-DT and SpMM, respectively.
MPI-3.0标准引入了邻域集合来支持许多应用程序中使用的稀疏通信模式。在本文中,我们提出了一种考虑系统物理拓扑和进程虚拟通信模式的分层分布式图拓扑,以提高大型消息邻域集体的性能。此外,我们在分层设计的基础上提出了两种设计方案:1。LAG-H:假设所有进程的通信负载相同。LAW-H:考虑进程之间的通信负载,以便在进程之间公平分配负载。我们提出了一个数学模型来确定每个进程的通信容量。然后,我们使用导出的容量在进程之间公平地分配负载。我们在多达28,672个进程上的实验结果显示,对于各种进程拓扑,加速速度可提高9倍。我们还观察到NAS-DT和SpMM的性能分别提高了8.2%和34倍的加速。
{"title":"A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives","authors":"S. M. Ghazimirsaeed, Qinghua Zhou, Amit Ruhela, Mohammadreza Bayatpour","doi":"10.1109/SC41405.2020.00038","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00038","url":null,"abstract":"The MPI-3.0 standard introduced neighborhood collective to support sparse communication patterns used in many applications. In this paper, we propose a hierarchical and distributed graph topology that considers the physical topology of the system and the virtual communication pattern of processes to improve the performance of large message neighborhood collectives. Moreover, we propose two design alternatives on top of the hierarchical design: 1. LAG-H: assumes the same communication load for all processes, 2. LAW-H: considers the communication load of processes for fair distribution of load between them. We propose a mathematical model to determine the communication capacity of each process. Then, we use the derived capacity to fairly distribute the load between processes. Our experimental results on up to 28,672 processes show up to 9x speedup for various process topologies. We also observe up to 8.2% performance gain and 34x speedup for NAS-DT and SpMM, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134507856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization 分布式存储并行对称非负矩阵分解
Srinivas Eswar, Koby Hayashi, Grey Ballard, R. Kannan, R. Vuduc, Haesun Park
We develop the first distributed-memory parallel implementation of Symmetric Nonnegative Matrix Factorization (SymNMF), a key data analytics kernel for clustering and dimensionality reduction. Our implementation includes two different algorithms for SymNMF, which give comparable results in terms of time and accuracy. The first algorithm is a parallelization of an existing sequential approach that uses solvers for non symmetric NMF. The second algorithm is a novel approach based on the Gauss-Newton method. It exploits second-order information without incurring large computational and memory costs. We evaluate the scalability of our algorithms on the Summit system at Oak Ridge National Laboratory, scaling up to 128 nodes (4,096 cores) with 70% efficiency. Additionally, we demonstrate our software on an image segmentation task.
我们开发了对称非负矩阵分解(SymNMF)的第一个分布式内存并行实现,SymNMF是聚类和降维的关键数据分析内核。我们的实现包括两种不同的SymNMF算法,它们在时间和准确性方面提供了可比较的结果。第一种算法是现有顺序方法的并行化,该方法使用非对称NMF的求解器。第二种算法是基于高斯-牛顿方法的一种新方法。它利用二阶信息,而不会产生大量的计算和内存成本。我们在橡树岭国家实验室的Summit系统上评估了我们算法的可扩展性,扩展到128个节点(4,096个核心),效率为70%。此外,我们在图像分割任务上演示了我们的软件。
{"title":"Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization","authors":"Srinivas Eswar, Koby Hayashi, Grey Ballard, R. Kannan, R. Vuduc, Haesun Park","doi":"10.1109/SC41405.2020.00078","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00078","url":null,"abstract":"We develop the first distributed-memory parallel implementation of Symmetric Nonnegative Matrix Factorization (SymNMF), a key data analytics kernel for clustering and dimensionality reduction. Our implementation includes two different algorithms for SymNMF, which give comparable results in terms of time and accuracy. The first algorithm is a parallelization of an existing sequential approach that uses solvers for non symmetric NMF. The second algorithm is a novel approach based on the Gauss-Newton method. It exploits second-order information without incurring large computational and memory costs. We evaluate the scalability of our algorithms on the Summit system at Oak Ridge National Laboratory, scaling up to 128 nodes (4,096 cores) with 70% efficiency. Additionally, we demonstrate our software on an image segmentation task.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122927359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
PLINER: Isolating Lines of Floating-Point Code for Compiler-Induced Variability PLINER:为编译器引起的可变性隔离浮点代码行
Hui Guo, I. Laguna, Cindy Rubio-González
Scientific applications are often impacted by numerical inconsistencies when using different compilers or when a compiler is used with different optimization levels; such inconsistencies hinder reproducibility and can be hard to diagnose. We present PLINER, a tool to automatically pinpoint code lines that trigger compiler-induced variability. PLINER uses a novel approach to enhance floating-point precision at different levels of code granularity, and performs a guided search to identify locations affected by numerical inconsistencies. We demonstrate PLINER on a real-world numerical inconsistency that required weeks to diagnose, which PLINER isolates in minutes. We also evaluate PLiNER on 100 synthetic programs, and the NAS Parallel Benchmarks (NPB). On the synthetic programs, PLiNER detects the affected lines of code 87% of the time while the stateof-the-art approach only detects the affected lines 6% of the time. Furthermore, PLINER successfully isolates all numerical inconsistencies found in the NPB.
当使用不同的编译器或以不同的优化级别使用编译器时,科学应用程序经常受到数值不一致的影响;这种不一致妨碍了再现性,而且很难诊断。我们提出PLINER,一个工具,自动定位代码行,触发编译器引起的变化。PLINER使用一种新颖的方法来提高不同级别代码粒度的浮点精度,并执行引导搜索来识别受数值不一致影响的位置。我们演示了PLINER在现实世界的数值不一致,需要数周的诊断,其中PLINER分离在几分钟内。我们还在100个合成程序和NAS并行基准(NPB)上评估了PLiNER。在合成程序中,PLiNER检测受影响的代码行率为87%,而最先进的方法仅检测受影响的代码行率为6%。此外,PLINER成功地分离了NPB中发现的所有数值不一致。
{"title":"PLINER: Isolating Lines of Floating-Point Code for Compiler-Induced Variability","authors":"Hui Guo, I. Laguna, Cindy Rubio-González","doi":"10.1109/SC41405.2020.00053","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00053","url":null,"abstract":"Scientific applications are often impacted by numerical inconsistencies when using different compilers or when a compiler is used with different optimization levels; such inconsistencies hinder reproducibility and can be hard to diagnose. We present PLINER, a tool to automatically pinpoint code lines that trigger compiler-induced variability. PLINER uses a novel approach to enhance floating-point precision at different levels of code granularity, and performs a guided search to identify locations affected by numerical inconsistencies. We demonstrate PLINER on a real-world numerical inconsistency that required weeks to diagnose, which PLINER isolates in minutes. We also evaluate PLiNER on 100 synthetic programs, and the NAS Parallel Benchmarks (NPB). On the synthetic programs, PLiNER detects the affected lines of code 87% of the time while the stateof-the-art approach only detects the affected lines 6% of the time. Furthermore, PLINER successfully isolates all numerical inconsistencies found in the NPB.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128671661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning RLScheduler:一个使用强化学习的自动化HPC批处理作业调度程序
Di Zhang, Dong Dai, Youbiao He, F. S. Bao, Bing Xie
Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority functions can hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations, we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use.
今天的高性能计算(HPC)平台仍然由批处理作业主导。因此,有效的批处理作业调度是提高系统效率的关键。现有的HPC批处理作业调度器通常利用启发式优先级函数对作业进行优先级排序和调度。但是,一旦由专家配置和部署,这些优先级函数很难适应作业负载、优化目标或系统设置的变化,当发生变化时可能导致系统效率下降。为了解决这个基本问题,我们提出了RLScheduler,一个基于强化学习的自动化HPC批处理作业调度器。RLScheduler依赖于最少的人工干预或专家知识,但可以通过自己不断的“试错”来学习高质量的调度策略。我们在RLScheduler中引入了一种新的基于核的神经网络结构和轨迹过滤机制,以改善和稳定学习过程。通过大量的评估,我们证实RLScheduler能够以相对较低的计算成本学习到针对各种工作负载和各种优化目标的高质量调度策略。此外,我们还展示了学习到的模型即使在应用于未见过的工作负载时也能稳定地执行,使它们对生产使用具有实用性。
{"title":"RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning","authors":"Di Zhang, Dong Dai, Youbiao He, F. S. Bao, Bing Xie","doi":"10.1109/SC41405.2020.00035","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00035","url":null,"abstract":"Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority functions can hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations, we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124430479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Scalable Heterogeneous Execution of a Coupled-Cluster Model with Perturbative Triples 具有摄动三元组的耦合簇模型的可伸缩异构执行
Jinsung Kim, Ajay Panyala, B. Peng, K. Kowalski, P. Sadayappan, S. Krishnamoorthy
The CCSD(T) coupled-cluster model with perturbative triples is considered a gold standard for computational modeling of the correlated behavior of electrons in molecular systems. A fundamental constraint is the relatively small global-memory capacity in GPUs compared to the main-memory capacity on host nodes, necessitating relatively smaller tile sizes for high-dimensional tensor contractions in NWChem’s GPU-accelerated implementation of the CCSD(T) method. A coordinated redesign is described to address this limitation and associated data movement overheads, including a novel fused GPU kernel for a set of tensor contractions, along with inter-node communication optimization and data caching. The new implementation of GPU-accelerated CCSD(T) improves overall performance by $3.4 times$. Finally, we discuss the trade-offs in using this fused algorithm on current and future supercomputing platforms.
具有摄动三元组的CCSD(T)耦合簇模型被认为是分子系统中电子相关行为计算建模的金标准。一个基本的限制是gpu的全局内存容量相对于主机节点上的主内存容量相对较小,因此在NWChem的gpu加速CCSD(T)方法实现中,高维张量收缩需要相对较小的块大小。本文描述了一种协调的重新设计,以解决这一限制和相关的数据移动开销,包括用于一组张量收缩的新型融合GPU内核,以及节点间通信优化和数据缓存。gpu加速CCSD(T)的新实现将整体性能提高了3.4倍。最后,我们讨论了在当前和未来的超级计算平台上使用这种融合算法的权衡。
{"title":"Scalable Heterogeneous Execution of a Coupled-Cluster Model with Perturbative Triples","authors":"Jinsung Kim, Ajay Panyala, B. Peng, K. Kowalski, P. Sadayappan, S. Krishnamoorthy","doi":"10.1109/SC41405.2020.00083","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00083","url":null,"abstract":"The CCSD(T) coupled-cluster model with perturbative triples is considered a gold standard for computational modeling of the correlated behavior of electrons in molecular systems. A fundamental constraint is the relatively small global-memory capacity in GPUs compared to the main-memory capacity on host nodes, necessitating relatively smaller tile sizes for high-dimensional tensor contractions in NWChem’s GPU-accelerated implementation of the CCSD(T) method. A coordinated redesign is described to address this limitation and associated data movement overheads, including a novel fused GPU kernel for a set of tensor contractions, along with inter-node communication optimization and data caching. The new implementation of GPU-accelerated CCSD(T) improves overall performance by $3.4 times$. Finally, we discuss the trade-offs in using this fused algorithm on current and future supercomputing platforms.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122041195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training GEMS: gpu支持的分布式DNN训练的内存感知模型并行系统
Arpan Jain, A. Awan, Asmaa Aljuhani, J. Hashmi, Quentin G. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani
Data-parallelism has become an established paradigm to train DNNs that fit inside GPU memory on large-scale HPC systems. However, model-parallelism is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design, and implement GEMS; a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-1ayer ResNet-lk model using 1,024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.
数据并行已经成为训练适合大规模HPC系统GPU内存的dnn的既定范例。然而,训练核外dnn需要模型并行性。在本文中,我们处理了使用数字病理学中常见的高分辨率图像训练非常大的dnn所提出的新要求。为了解决这些问题,我们提出、设计和实施GEMS;一个支持gpu的内存感知模型并行系统。我们提出了几种设计方案,如GEMS-MAST, GEMS-MASTER和GEMS-Hybrid,它们比最先进的系统(如Mesh-TensorFlow和FlexFlow)提供了出色的加速。此外,我们将模型并行性和数据并行性结合起来,使用1,024个Volta V100 gpu以97.32%的扩展效率训练了1000层resnet - like模型。对于100,000 x 100,000像素的真实世界组织病理学全幻灯片图像(WSI),我们在大小为1024 x 1024的图像块上训练自定义ResNet-110-v2,并将训练时间从7小时减少到28分钟。
{"title":"GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training","authors":"Arpan Jain, A. Awan, Asmaa Aljuhani, J. Hashmi, Quentin G. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani","doi":"10.1109/SC41405.2020.00049","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00049","url":null,"abstract":"Data-parallelism has become an established paradigm to train DNNs that fit inside GPU memory on large-scale HPC systems. However, model-parallelism is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design, and implement GEMS; a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-1ayer ResNet-lk model using 1,024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131719451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Foresight: Analysis That Matters for Data Reduction 前瞻:对数据缩减至关重要的分析
Pascal Grosset, C. Biwer, Jesus Pulido, A. Mohan, Ayan Biswas, J. Patchett, Terece L. Turton, D. Rogers, D. Livescu, J. Ahrens
As the computation power of supercomputers increases, so does simulation size, which in turn produces orders-of-magnitude more data. Because generated data often exceed the simulation’s disk quota, many simulations would stand to benefit from data-reduction techniques to reduce storage requirements. Such techniques include autoencoders, data compression algorithms, and sampling. Lossy compression techniques can significantly reduce data size, but such techniques come at the expense of losing information that could result in incorrect post hoc analysis results. To help scientists determine the best compression they can get while keeping their analyses accurate, we have developed Foresight, an analysis framework that enables users to evaluate how different data-reduction techniques will impact their analyses. We use particle data from a cosmology simulation, turbulence data from Direct Numerical Simulation, and asteroid impact data from xRage to demonstrate how Foresight can help scientists determine the best data-reduction technique for their simulations.
随着超级计算机计算能力的提高,模拟规模也在增加,这反过来又产生了数量级的数据。由于生成的数据经常超过模拟的磁盘配额,因此许多模拟将受益于数据缩减技术,以减少存储需求。这些技术包括自动编码器、数据压缩算法和采样。有损压缩技术可以显著减小数据大小,但这种技术的代价是丢失信息,从而导致不正确的事后分析结果。为了帮助科学家确定他们可以获得的最佳压缩,同时保持分析的准确性,我们开发了Foresight,这是一个分析框架,使用户能够评估不同的数据缩减技术将如何影响他们的分析。我们使用来自宇宙学模拟的粒子数据,来自直接数值模拟的湍流数据,以及来自xRage的小行星撞击数据来演示Foresight如何帮助科学家为他们的模拟确定最佳的数据简化技术。
{"title":"Foresight: Analysis That Matters for Data Reduction","authors":"Pascal Grosset, C. Biwer, Jesus Pulido, A. Mohan, Ayan Biswas, J. Patchett, Terece L. Turton, D. Rogers, D. Livescu, J. Ahrens","doi":"10.1109/SC41405.2020.00087","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00087","url":null,"abstract":"As the computation power of supercomputers increases, so does simulation size, which in turn produces orders-of-magnitude more data. Because generated data often exceed the simulation’s disk quota, many simulations would stand to benefit from data-reduction techniques to reduce storage requirements. Such techniques include autoencoders, data compression algorithms, and sampling. Lossy compression techniques can significantly reduce data size, but such techniques come at the expense of losing information that could result in incorrect post hoc analysis results. To help scientists determine the best compression they can get while keeping their analyses accurate, we have developed Foresight, an analysis framework that enables users to evaluate how different data-reduction techniques will impact their analyses. We use particle data from a cosmology simulation, turbulence data from Direct Numerical Simulation, and asteroid impact data from xRage to demonstrate how Foresight can help scientists determine the best data-reduction technique for their simulations.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130418725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1