SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

[Copyright notice] (版权)

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/sc41405.2020.00002

引用次数: 0

Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures 通过矩阵签名的高效平铺稀疏矩阵乘法

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00091

Süreyya Emre Kurt, Aravind Sukumaran-Rajam, F. Rastello, P. Sadayappan

Tiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix - Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art.

平铺是矩阵计算中减少数据移动的关键技术。虽然平铺被很好地理解并广泛用于密集矩阵/张量计算，但稀疏矩阵计算的有效平铺仍然是一个具有挑战性的问题。本文提出了一种新的方法，有效地将矩阵的稀疏性结构对可实现的数据重用的影响总结为一维签名，然后将其用于稀疏矩阵计算中瓦片大小优化的分析成本模型。基于稀疏矩阵的两个关键核:稀疏矩阵-密集矩阵乘法(SpMM)和采样密集-密集矩阵乘法(SDDMM)，对模型驱动的稀疏平铺方法进行了评估。实验结果表明，基于模型的平铺式SpMM和SDDMM具有较好的性能。

引用次数: 16

GVPROF: A Value Profiler for GPU-Based Clusters GVPROF:一个基于gpu集群的值分析器

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00093

K. Zhou, Yueming Hao, J. Mellor-Crummey, Xiaozhu Meng, Xu Liu

GPGPUs are widely used in high-performance computing systems to accelerate scientific and machine learning workloads. Developing efficient GPU kernels is critically important to obtain “bare-metal” performance on GPU-based clusters. In this paper, we describe the design and implementation of GVPROF, the first value profiler that pinpoints value-related inefficiencies in applications running on NVIDIA GPU-based clusters. The novelty of GVPROF resides in its ability to detect temporal and spatial value redundancies, which provides useful information to guide code optimization. GVPROF can monitor production multi-node multi-GPU executions in clusters. Our experiments with well-known GPU benchmarks and HPC applications show that GVPROF incurs acceptable overhead and scales to large executions. Using GVPROF, we optimized several HPC and machine learning workloads on one NVIDIA V100 GPU. In one case study of LAMMPS, optimizations based on information from GVProf led to whole-program speedups ranging from 1.37x on a single GPU to 1.08x on 64 GPUs.

gpgpu广泛应用于高性能计算系统中，用于加速科学和机器学习工作负载。开发高效的GPU内核对于在基于GPU的集群上获得“裸机”性能至关重要。在本文中，我们描述了GVPROF的设计和实现，GVPROF是第一个价值分析器，它可以精确定位在基于NVIDIA gpu的集群上运行的应用程序中与值相关的低效率。GVPROF的新颖之处在于它能够检测时间和空间的值冗余，从而为指导代码优化提供有用的信息。GVPROF可以监控集群中多节点多gpu的执行情况。我们对著名的GPU基准测试和HPC应用程序的实验表明，GVPROF会产生可接受的开销，并且可以扩展到大型执行。使用GVPROF，我们在一个NVIDIA V100 GPU上优化了几个HPC和机器学习工作负载。在LAMMPS的一个案例研究中，基于GVProf信息的优化导致整个程序的速度从单个GPU上的1.37倍到64个GPU上的1.08倍不等。

引用次数: 16

RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM rmpp - kv:设计基于PMEM的远程直接内存持久性键值存储

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00056

Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu

Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMPKV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.

字节可寻址的持久内存(PMEM)可以由支持远程直接内存访问(RDMA)的网络直接操作。然而，现有的结合RDMA和PMEM的研究由于其与PMEM无关的通信协议而无法达到预期的性能。在本文中，我们提出了一种新的基于pmem的基于rdma的持久性键值存储通信协议，称为基于远程直接内存持久性的键值存储(RDMPKV)。rmpp - kv采用一种混合的“服务器-应答/服务器-旁路”方法，在配备pmemm的服务器上“持久地”存储单个键值对象。rdm - kv的运行时可以很容易地适应现有的(服务器辅助的持久性)和新兴的(设备持久性)rdma支持的互连，同时通过轻量级一致性方案确保服务器的可扩展性。性能评估表明，与pmemm无关的RDMA- '服务器-应答'协议相比，rdm - kv可以将具有不同持久键值存储架构的服务器端性能提高多达22倍。我们的评估还表明，rdm - kv比基于分布式pmem的文件系统性能高出65%，比最近的rdm -to- pmem框架性能高出71%。

{"title":"RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM","authors":"Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu","doi":"10.1109/SC41405.2020.00056","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00056","url":null,"abstract":"Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMPKV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems 用于深度学习训练系统的高效非侵入式GPU调度框架

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00094

Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo

Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.

高效的GPU调度是最小化深度学习训练工作负载执行时间的关键。DL训练系统调度器通常为每个作业分配固定数量的gpu，这抑制了资源的高利用率，并且经常延长整体训练时间。最近引入的可以动态重新分配gpu的调度器实现了更好的集群效率。然而，这种动态特性通过终止和重新启动作业引入了额外的开销，或者需要修改DL训练框架。我们提出并开发了一种高效、非侵入式的GPU调度框架，该框架采用自适应GPU调度程序和弹性GPU分配机制相结合，以减少深度学习训练工作负载的完成时间，提高资源利用率。具体来说，自适应GPU调度器包括一个调度算法，该算法使用训练作业进度信息来确定在任何给定时间为传入和运行的作业最有效地分配和重新分配GPU。弹性GPU分配机制与调度器协同工作。它提供了一种轻量级且非侵入式的方法来重新分配gpu，该方法基于“SideCar”进程，该进程可以暂时停止并重新启动使用不同数量gpu的作业深度学习训练过程。我们将调度框架作为插件在Kubernetes中实现，并基于TensorFlow在两个16 gpu集群上进行了多个训练任务的评估。结果表明，与Kubernetes默认调度器相比，我们提出的调度框架将总体执行时间和平均作业完成时间分别减少了45%和63%。与基于终止的调度器相比，我们的框架将总执行时间和平均作业完成时间分别减少了20%和37%。

{"title":"An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems","authors":"Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo","doi":"10.1109/SC41405.2020.00094","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00094","url":null,"abstract":"Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116682632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Improving All-to-Many Personalized Communication in Two-Phase I/O 改进两阶段I/O中所有对多的个性化通信

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00014

Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao

As modern parallel computers enter the exascale era, the communication cost for redistributing requests becomes a significant bottleneck in MPIIO routines. The communication kernel for request redistribution, which has an all-to-many personalized communication pattern for application programs with a large number of noncontiguous requests, plays an essential role in the overall performance. This paper explores the available communication kernels for two-phase I/O communication. We generalize the spread-out algorithm to adapt to the all-to-many communication pattern of two-phase I/O by reducing the communication straggler effect. Communication throttling methods that reduce communication contention for asynchronous MPI implementation are adopted to improve communication performance further. Experimental results are presented using different communication kernels running on Cray XC40 Cori and IBM AC922 Summit supercomputers with different I/O patterns. Our study shows that adjusting communication kernel algorithms for different I/O patterns can improve the end-to-end performance up to 10 times compared with default MPI-IO implementations.

随着现代并行计算机进入百亿亿次时代，重分配请求的通信成本成为MPIIO例程的一个重要瓶颈。请求重分发的通信内核对具有大量不连续请求的应用程序具有多对多的个性化通信模式，在整体性能中起着至关重要的作用。本文探讨了两阶段I/O通信的可用通信内核。通过减少通信离散效应，对展开算法进行了推广，以适应多对多的两相I/O通信模式。为了进一步提高通信性能，采用了减少异步MPI实现中通信争用的通信节流方法。给出了在Cray XC40 Cori和IBM AC922 Summit超级计算机上运行不同I/O模式的通信内核的实验结果。我们的研究表明，与默认MPI-IO实现相比，针对不同的I/O模式调整通信内核算法可以将端到端性能提高10倍。

{"title":"Improving All-to-Many Personalized Communication in Two-Phase I/O","authors":"Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao","doi":"10.1109/SC41405.2020.00014","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00014","url":null,"abstract":"As modern parallel computers enter the exascale era, the communication cost for redistributing requests becomes a significant bottleneck in MPIIO routines. The communication kernel for request redistribution, which has an all-to-many personalized communication pattern for application programs with a large number of noncontiguous requests, plays an essential role in the overall performance. This paper explores the available communication kernels for two-phase I/O communication. We generalize the spread-out algorithm to adapt to the all-to-many communication pattern of two-phase I/O by reducing the communication straggler effect. Communication throttling methods that reduce communication contention for asynchronous MPI implementation are adopted to improve communication performance further. Experimental results are presented using different communication kernels running on Cray XC40 Cori and IBM AC922 Summit supercomputers with different I/O patterns. Our study shows that adjusting communication kernel algorithms for different I/O patterns can improve the end-to-end performance up to 10 times compared with default MPI-IO implementations.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134436358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications 大型系统的工作特征:长期分析、量化和影响

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00088

Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari

HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective work-load analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind – covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.

高性能计算工作负载分析和资源消耗特征是推动更好的操作实践、系统采购决策和设计有效的资源管理技术的关键。不幸的是，对于生产规模的HPC系统，HPC社区并不容易获得长期的内省工作负载分析和表征。本研究通过在两台超级计算机:Intrepid和Mira上提供详细的长期量化、表征和工作特征分析，弥补了这一差距。这项研究是同类研究中规模最大的研究之一，涵盖了超过30亿计算小时、75万个工作岗位的趋势和特征，跨度长达十年。我们确认了一些长期以来的传统智慧，并确定了许多以前未被发现的趋势及其影响。我们还介绍了一种基于学习的技术，可以使用作业提交之前可用的特性，高精度地预测未来作业的资源需求，而不需要任何特定于应用程序的跟踪或应用程序侵入性工具。

{"title":"Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications","authors":"Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari","doi":"10.1109/SC41405.2020.00088","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00088","url":null,"abstract":"HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective work-load analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind – covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124906358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short FatPaths:当最短路径不足时，超级计算机和数据中心中的路由

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00031

Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler

We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich (“fat”) diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned “purified” transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2” lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.Extended paper version: https://arxiv.org/abs/1906.10885

我们介绍了FatPaths:一种简单、通用和健壮的路由架构，使最先进的低直径拓扑(如Slim Fly)能够实现前所未有的性能。FatPaths的目标是HPC超级计算机以及云数据中心和集群中的以太网堆栈。FatPaths为高性能多路径暴露并利用了最小路径和非最小路径的丰富(“丰富”)多样性。此外，FatPaths使用了一个重新设计的“纯化”传输层，它几乎消除了所有TCP性能问题(例如，慢启动)，并结合了流交换，一种在TCP网络中用于防止数据包重新排序的技术，以实现非常简单和有效的负载平衡。我们的设计使最近的低直径拓扑能够超越强大的Clos设计，在同等成本下以2英寸的低延迟实现15%的高净吞吐量。FatPaths将显著加速占Top500列表50%以上的以太网集群，并可能成为现代拓扑的标准路由方案。扩展纸质版:https://arxiv.org/abs/1906.10885

{"title":"FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short","authors":"Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler","doi":"10.1109/SC41405.2020.00031","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00031","url":null,"abstract":"We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich (“fat”) diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned “purified” transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2” lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.Extended paper version: https://arxiv.org/abs/1906.10885","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124195517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds Alita:通过偏见资源管理实现公共云的全面性能隔离

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00036

Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo

The tenants of public cloud platforms share hard-ware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.

公有云平台的租户在同一节点上共享硬件资源，存在性能干扰(或恶意攻击)的风险。通过过度使用共享内存总线、最后一级缓存(LLC)/内存带宽和电源，租户能够显著降低同一节点上邻居的性能。为了消除这种不公平，我们提出了一种由在线干扰识别器和自适应干扰消除器组成的运行时系统Alita。干扰标识符监视硬件和系统级事件统计，以识别资源污染者。消除器仅通过限制污染者的资源使用来提高正常应用程序的性能。具体来说，Alita采用总线锁定稀疏化、偏置LLC/带宽隔离和选择性功率节流来限制污染者的资源使用。在一个实验平台和一个拥有30,000个节点的生产云平台上的结果表明，基于系统级知识，Alita在资源污染者存在的情况下显著提高了共置虚拟机的性能。

{"title":"Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds","authors":"Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo","doi":"10.1109/SC41405.2020.00036","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00036","url":null,"abstract":"The tenants of public cloud platforms share hard-ware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132457760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field 现场未校正DRAM错误的成本感知预测

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00065

Isaac Boixaderas, D. Zivanovic, Sergi Moré, Javier Bartolome, David Vicente, Marc Casas, P. Carpenter, Petar Radojkovic, E. Ayguadé

This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.

本文提出并评估了一种预测DRAM未校正错误的方法，这是大规模高性能计算集群中硬件故障的主要原因。该方法使用随机森林分类器，该分类器使用MareNostrum 3超级计算机两年生产的错误日志进行训练和评估。通过使系统能够采取措施来减轻节点故障，我们的方法将损失的计算时间减少了57%，每年净节省21,000个节点小时。我们将所有源代码作为开放源代码发布。我们还讨论和澄清了对DRAM预测方法在实践中有用至关重要的方法论方面。我们解释了为什么标准的评估指标，如精确度和召回率，是不够的，并基于成本效益分析的评估。这种方法可以帮助确保任何DRAM误差预测器都不受训练偏差的影响，并且具有明确的成本效益计算。

引用次数: 15

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀