Proceedings of the 48th International Conference on Parallel Processing最新文献_第7页

Reducing Kernel Surface Areas for Isolation and Scalability 减少核表面积以实现隔离和可伸缩性

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337900

Daniel Zahka, Brian Kocoloski, Katarzyna Keahey

Isolation is a desirable property for applications executing in multi-tenant computing systems. On the performance side, hardware resource isolation via partitioning mechanisms is commonly applied to achieve QoS, a necessary property for many noise-sensitive parallel workloads. Conversely, on the software side, partitioning is used, usually in the form of virtual machines, to provide secure environments with smaller attack surfaces than those present in shared software stacks. In this paper, we identify a further benefit from isolation, one that is currently less appreciated in most parallel computing settings: isolation of system software stacks, including OS kernels, can lead to significant performance benefits through a reduction in variability. To highlight the existing problem in shared software stacks, we first developed a new systematic approach to measure and characterize latent sources of variability in the Linux kernel. Using this approach, we find that hardware VMs are effective substrates for limiting kernel-level interference that otherwise occurs in monolithic kernel systems. Furthermore, by enabling reductions in variability, we find that virtualized environments often have superior worst-case performance characteristics than native or containerized environments. Finally, we demonstrate that due to their isolated software contexts, most virtualized applications consistently outperform their bare-metal counterparts when executing on 64-nodes of a multi-tenant, kernel-intensive cloud system.

隔离是在多租户计算系统中执行的应用程序的理想属性。在性能方面，通过分区机制进行硬件资源隔离通常用于实现QoS，这是许多对噪声敏感的并行工作负载的必要属性。相反，在软件方面，通常以虚拟机的形式使用分区，以提供比共享软件堆栈中存在的攻击面更小的安全环境。在本文中，我们确定了隔离的另一个好处，一个目前在大多数并行计算设置中不太受重视的好处:系统软件堆栈(包括操作系统内核)的隔离可以通过减少可变性来带来显著的性能优势。为了突出共享软件栈中存在的问题，我们首先开发了一种新的系统方法来测量和描述Linux内核中潜在的可变性来源。使用这种方法，我们发现硬件vm是限制内核级干扰的有效基础，否则会发生在单片内核系统中。此外，通过减少可变性，我们发现虚拟化环境通常比本地环境或容器化环境具有更好的最坏情况性能特征。最后，我们将证明，由于它们的软件上下文是隔离的，大多数虚拟化应用程序在多租户、内核密集型云系统的64个节点上执行时，始终优于裸机应用程序。

{"title":"Reducing Kernel Surface Areas for Isolation and Scalability","authors":"Daniel Zahka, Brian Kocoloski, Katarzyna Keahey","doi":"10.1145/3337821.3337900","DOIUrl":"https://doi.org/10.1145/3337821.3337900","url":null,"abstract":"Isolation is a desirable property for applications executing in multi-tenant computing systems. On the performance side, hardware resource isolation via partitioning mechanisms is commonly applied to achieve QoS, a necessary property for many noise-sensitive parallel workloads. Conversely, on the software side, partitioning is used, usually in the form of virtual machines, to provide secure environments with smaller attack surfaces than those present in shared software stacks. In this paper, we identify a further benefit from isolation, one that is currently less appreciated in most parallel computing settings: isolation of system software stacks, including OS kernels, can lead to significant performance benefits through a reduction in variability. To highlight the existing problem in shared software stacks, we first developed a new systematic approach to measure and characterize latent sources of variability in the Linux kernel. Using this approach, we find that hardware VMs are effective substrates for limiting kernel-level interference that otherwise occurs in monolithic kernel systems. Furthermore, by enabling reductions in variability, we find that virtualized environments often have superior worst-case performance characteristics than native or containerized environments. Finally, we demonstrate that due to their isolated software contexts, most virtualized applications consistently outperform their bare-metal counterparts when executing on 64-nodes of a multi-tenant, kernel-intensive cloud system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132982007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

diBELLA: Distributed Long Read to Long Read Alignment diBELLA:分布式长读对长读对齐

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337919

Marquita Ellis, Giulia Guidi, A. Buluç, L. Oliker, K. Yelick

We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.

我们提出了一种用于基因组分析的并行算法和可扩展实现，特别是发现来自“第三代”长读测序仪[29]的数据重叠和对齐的问题。虽然长DNA序列为生物分析和洞察提供了巨大的优势，但目前的长读测序仪器有很高的错误率，因此需要与短读测序仪器不同的分析方法。我们的工作重点是一个精确的单节点算法的高效分布式内存并行化，用于重叠和对齐长读取。我们通过解决增加并行性、最小化通信、限制内存占用和确保良好负载平衡等竞争问题来实现这种不规则算法的可伸缩性。由此产生的应用程序diBELLA是第一个专门为长读取和并行可伸缩性设计的分布式内存重叠和对齐器。我们对高级设计权衡进行了描述和分析，并进行了广泛的实证分析，比较了最先进的高性能计算系统和商业云架构的性能特征，突出了最先进的网络技术的优势。

{"title":"diBELLA: Distributed Long Read to Long Read Alignment","authors":"Marquita Ellis, Giulia Guidi, A. Buluç, L. Oliker, K. Yelick","doi":"10.1145/3337821.3337919","DOIUrl":"https://doi.org/10.1145/3337821.3337919","url":null,"abstract":"We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from \"third generation\" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114258890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters 繁忙数据密集型并行计算集群的协同作业调度与数据分配

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337864

Guoxin Liu, Haiying Shen, Haoyu Wang

In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.

在数据密集型并行计算集群中，重要的是在最小化资源使用(例如，网络带宽和能源)的同时为作业提供有期限保证的服务。在当前的计算框架下(先分配数据后调度作业)，在一个作业较多的繁忙集群中，很难同时实现这些目标。采用整数规划方法对问题进行建模，并提出了一种启发式协同作业调度和数据分配方法。CSA将当前计算框架中数据分配和作业调度的顺序进行了新颖的反转，将数据优先-作业-秒改为作业优先-数据-秒。它使CSA能够在执行截止日期感知调度时，主动地将具有更多常见请求数据的任务合并到同一台服务器上，并且还可以将任务合并到尽可能少的服务器上，以最大限度地节省能源。这有助于后续的数据分配步骤，将数据块分配给承载该数据的大多数请求者任务的服务器，从而最大限度地增强数据局部性并减少带宽消耗。CSA还有一个递归调度优化过程，用于调整作业和数据分配调度，以提高关于这三个目标的系统性能，并在指定权重下实现数据局部性和节能之间的权衡。我们在一个真正的超级计算集群上，在Apache Hadoop上实现了CSA和许多以前的作业调度器。仿真和实际集群的跟踪驱动实验表明，CSA在提供截止日期保证和资源高效服务方面优于其他调度程序。

{"title":"Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters","authors":"Guoxin Liu, Haiying Shen, Haoyu Wang","doi":"10.1145/3337821.3337864","DOIUrl":"https://doi.org/10.1145/3337821.3337864","url":null,"abstract":"In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123720947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HPAS 下丘脑-垂体-肾上腺轴的

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337907

E. Ates, Yijia Zhang, Burak Aksar, Jim Brandt, V. Leung, Manuel Egele, A. Coskun

Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing "anomalies" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.

{"title":"HPAS","authors":"E. Ates, Yijia Zhang, Burak Aksar, Jim Brandt, V. Leung, Manuel Egele, A. Coskun","doi":"10.1145/3337821.3337907","DOIUrl":"https://doi.org/10.1145/3337821.3337907","url":null,"abstract":"Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing \"anomalies\" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114455563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Design Exploration of Multi-tier Interconnection Networks for Exascale Systems 百亿亿级系统多层互连网络的设计探索

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337903

J. Navaridas, Joshua Lant, J. A. Pascual, M. Luján, J. Goodacre

Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones when dealing with large-scale applicationlike traffic. In addition we explore how different aspects of the hybrid topology can affect the overall performance of the system. In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density of connections is high enough--one connection every two or four nodes seems to be the sweet spot--and the size of the subtori is limited to a few nodes per dimension. Moreover, we explored two different alternatives to use in the upper tiers of the interconnect, a fattree and a generalised hypercube, and found little difference between the topologies, mostly depending on the workload to be executed.

当涉及到扩展计算系统时，互连网络是主要的限制因素之一。在本文中，我们探讨了拓扑的杂交在最先进的百亿亿级计算系统的设计中的作用。更准确地说，我们比较了几种混合拓扑，并在处理大规模应用程序流量时与常见的单一拓扑进行了比较。此外，我们还探讨了混合拓扑的不同方面如何影响系统的整体性能。特别是，我们发现，只要连接密度足够高(每两个或四个节点一个连接似乎是最佳点)，混合拓扑可以胜过最先进的环面和脂肪树网络，并且子图的大小被限制在每个维度的几个节点内。此外，我们还研究了在互连的上层使用的两种不同的替代方案，即胖树和广义超立方体，发现拓扑之间几乎没有区别，主要取决于要执行的工作负载。

引用次数: 1

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning 深度学习BeeGFS的I/O表征和性能评估

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337902

Fahim Chowdhury, Yue Zhu, T. Heer, Saul Paredes, A. Moody, R. Goldstone, K. Mohror, Weikuan Yu

Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.

并行文件系统(pfs)经常部署在高性能计算(HPC)系统上，以确保高效的I/O、持久的存储和可扩展的性能。新兴的深度学习(DL)应用程序会对HPC系统产生新的I/O和存储需求，并批量输入小型随机文件。这就要求pfs具有能够满足DL应用程序需求的相应特性。BeeGFS是最近出现的一种PFS，由于其性能、可扩展性和易用性而引起了研究和工业界的注意。在强调BeeGFS的系统性能分析的同时，本文介绍了BeeGFS的架构和系统特征，并使用先进的I/O，元数据和DL应用基准进行了实验评估。特别是，我们使用了AlexNet和ResNet-50模型对ImageNet数据集进行分类，使用了Livermore大人工神经网络工具包(LBANN)，以及基于TensorFlow和Horovod的ImageNet数据读取器管道。通过对BeeGFS进行广泛的性能表征，我们的研究为如何将BeeGFS用于新兴的深度学习应用程序提供了有用的文档。

{"title":"I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning","authors":"Fahim Chowdhury, Yue Zhu, T. Heer, Saul Paredes, A. Moody, R. Goldstone, K. Mohror, Weikuan Yu","doi":"10.1145/3337821.3337902","DOIUrl":"https://doi.org/10.1145/3337821.3337902","url":null,"abstract":"Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125847361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

A Tale of Two (Flow) Tables: Demystifying Rule Caching in OpenFlow Switches 两个(流)表的故事:揭秘OpenFlow交换机中的规则缓存

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337896

Rui Li, Yu Pang, Jin Zhao, Xin Wang

Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.

软件定义网络(SDN)通过在OpenFlow交换机中部署细粒度规则来实现灵活的流量控制。现代商品交换机通常使用TCAM来存储这些规则并执行高速并行查找。虽然效率很高，但TCAM的容量有限，因为TCAM成本昂贵且耗电。规则数量的爆炸性增长加剧了TCAM的局限性。在实现TCAM和RAM混合流表方面已经做了相当多的努力，其中高速的TCAM作为缓存来存储最流行的规则，而廉价的RAM用于处理缓存缺失。设计混合TCAM/RAM流表的主要挑战在于如何提高缓存命中率以及如何在TCAM和RAM之间分配规则时处理通配符规则依赖。本文提出了一种实用高效的混合交换机规则缓存方案CuCa的设计与评价。与现有方案不同，CuCa为规则缓存提供了离线和在线算法，对应于OpenFlow规则安装的主动和被动方法。通过在TCAM中设计两阶段缓存体系结构，cca可以有效地处理规则依赖性，并提供显著的性能改进。仿真和现实世界的实验结果表明，与最先进的方案相比，CuCa将平均TCAM命中率提高了38.7%，与商用OpenFlow交换机的默认缓存算法相比，提高了33%以上。

{"title":"A Tale of Two (Flow) Tables: Demystifying Rule Caching in OpenFlow Switches","authors":"Rui Li, Yu Pang, Jin Zhao, Xin Wang","doi":"10.1145/3337821.3337896","DOIUrl":"https://doi.org/10.1145/3337821.3337896","url":null,"abstract":"Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Exploiting Vector Processing in Dynamic Binary Translation

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337844

Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu

Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.

自动向量化技术已经被编译器采用了几十年，以利用并行处理中的数据级并行性。然而，由于处理器体系结构不断增强新特性以提高矢量/SIMD性能，遗留应用程序二进制文件无法在现代体系结构中充分利用新的矢量/SIMD功能。例如，遗留的ARMv7二进制文件不能受益于ARMv8 SIMD双精度功能，遗留的x86二进制文件不能享受AVX-512扩展的强大功能。在本文中，我们研究了跨isa动态二进制转换(DBT)所涉及的基本问题，将非矢量化循环转换为矢量/SIMD形式，以在较新的处理器架构中实现更高的计算吞吐量。其关键思想是从这些应用程序二进制文件中恢复关键循环信息，以便在运行时执行向量化。实验结果表明，在ARMv7到armv8动态二进制转换系统中，与ARMv7本机运行的各种基准测试相比，我们的方法实现了1.42倍的平均加速。

引用次数: 0

A Plugin Architecture for the TAU Performance System TAU性能系统的插件架构

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337916

A. Malony, Srinivasan Ramesh, K. Huck, Nicholas Chaimov, S. Shende

Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.

已经为并行机器创建了几个健壮的性能系统，这些系统能够观察不同硬件平台上应用程序执行的不同方面。所有这些都是为了支持高效、便携和可扩展的测量方法而设计的。由于这些原因，性能度量基础结构与应用程序代码和运行时执行环境紧密嵌入。随着并行软件和系统的发展，特别是朝着异构、异步和动态操作的方向发展，预计对性能观察和感知的需求将发生变化。例如，异构机器引入了要捕获的新类型的性能数据和要描述的性能行为。此外，人们对与性能基础设施进行交互以实现原位分析和基于策略的控制越来越感兴趣。问题是，现有的性能系统架构在满足这些新需求的能力方面可能受到限制。本文报告了我们在TAU绩效系统背景下解决这一问题的研究努力。特别地，我们考虑使用一个强大的插件模型来捕获TAU中的现有功能，并以一种不一定是最初设想的方式扩展其功能。TAU插件架构支持三种类型的插件范例:EVENT、TRIGGER和AGENT。我们将演示它们在几种不同场景下的操作方式。大规模实验的结果表明，可以保持效率和健壮性，同时可以提供新的灵活性和可编程性，利用核心TAU系统的功能，同时允许实现重要和引人注目的扩展。

{"title":"A Plugin Architecture for the TAU Performance System","authors":"A. Malony, Srinivasan Ramesh, K. Huck, Nicholas Chaimov, S. Shende","doi":"10.1145/3337821.3337916","DOIUrl":"https://doi.org/10.1145/3337821.3337916","url":null,"abstract":"Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Parallel Algorithms for Evaluating Matrix Polynomials 求矩阵多项式的并行算法

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337871

Sivan Toledo, Amit Waisel

We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.

我们开发和评估并行算法在数值计算的一个基本问题，即评估一个矩阵的多项式。该算法由许多构建块组成，这些构建块可以用几种方式组装。我们研究单个构建块中的并行性，开发并行实现，并将它们组装成一个整体并行算法。我们分析了矩阵的维数和多项式的度对算术复杂度和并行性的影响，并由此提出了在不同情况下使用的变体。我们的理论结果表明，该算法的一种变体，基于将patterson - stockmeyer方法应用于整个矩阵，在几乎任何矩阵维数和多项式次上都非常有效地并行化。然而，从算法复杂度的角度来看，它并不是最有效的。另一种基于davis - higham块递归的算法从算法复杂度的角度来看效率要高得多，但其构建模块之一是序列。在双插槽28核服务器上的实验结果表明，第一种算法可以有效地使用所有内核，但在高次多项式上，第二种算法通常更快，尽管有顺序阶段。这表明我们的并行算法在其他阶段确实是有效的。

{"title":"Parallel Algorithms for Evaluating Matrix Polynomials","authors":"Sivan Toledo, Amit Waisel","doi":"10.1145/3337821.3337871","DOIUrl":"https://doi.org/10.1145/3337821.3337871","url":null,"abstract":"We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117324691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0