首页 > 最新文献

Proceedings of the 48th International Conference on Parallel Processing最新文献

英文 中文
A Tale of Two (Flow) Tables: Demystifying Rule Caching in OpenFlow Switches 两个(流)表的故事:揭秘OpenFlow交换机中的规则缓存
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337896
Rui Li, Yu Pang, Jin Zhao, Xin Wang
Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.
软件定义网络(SDN)通过在OpenFlow交换机中部署细粒度规则来实现灵活的流量控制。现代商品交换机通常使用TCAM来存储这些规则并执行高速并行查找。虽然效率很高,但TCAM的容量有限,因为TCAM成本昂贵且耗电。规则数量的爆炸性增长加剧了TCAM的局限性。在实现TCAM和RAM混合流表方面已经做了相当多的努力,其中高速的TCAM作为缓存来存储最流行的规则,而廉价的RAM用于处理缓存缺失。设计混合TCAM/RAM流表的主要挑战在于如何提高缓存命中率以及如何在TCAM和RAM之间分配规则时处理通配符规则依赖。本文提出了一种实用高效的混合交换机规则缓存方案CuCa的设计与评价。与现有方案不同,CuCa为规则缓存提供了离线和在线算法,对应于OpenFlow规则安装的主动和被动方法。通过在TCAM中设计两阶段缓存体系结构,cca可以有效地处理规则依赖性,并提供显著的性能改进。仿真和现实世界的实验结果表明,与最先进的方案相比,CuCa将平均TCAM命中率提高了38.7%,与商用OpenFlow交换机的默认缓存算法相比,提高了33%以上。
{"title":"A Tale of Two (Flow) Tables: Demystifying Rule Caching in OpenFlow Switches","authors":"Rui Li, Yu Pang, Jin Zhao, Xin Wang","doi":"10.1145/3337821.3337896","DOIUrl":"https://doi.org/10.1145/3337821.3337896","url":null,"abstract":"Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Efficient Data-Parallel Primitives on Heterogeneous Systems 异构系统中高效的数据并行基元
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337920
Zhuohang Lai, Qiong Luo, Xiaolong Xie
Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.
数据并行原语,如gather、scatter、scan和split,广泛用于数据密集型应用程序。然而,在由异构处理器组成的系统上优化它们是具有挑战性的。在本文中,我们研究和比较了一组数据并行原语在GPU、CPU和Xeon Phi协处理器上的现有实现和优化策略。我们的目标是确定在不同架构上实现数据并行原语操作的关键性能因素,并开发在各种平台上有效实现这些原语的通用策略。我们引入了一种可移植且高效的顺序存储器访问模式,消除了为单个设备调整存储器访问模式的成本。通过适当的调优,我们优化的原语实现可以达到与本机版本相当的性能。此外,我们的分析结果表明,CPU和Phi协处理器共享大多数优化策略,而GPU由于这些设备之间的硬件差异而差异很大,例如向量化,数据和TLB缓存以及数据预取的效率。我们总结了这些因素,并为异构系统提供了通用的原始优化策略。
{"title":"Efficient Data-Parallel Primitives on Heterogeneous Systems","authors":"Zhuohang Lai, Qiong Luo, Xiaolong Xie","doi":"10.1145/3337821.3337920","DOIUrl":"https://doi.org/10.1145/3337821.3337920","url":null,"abstract":"Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125332264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
EMBA
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337863
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang
EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.
Emba 604战略分析。(2)本课程提供了一个竞争分析和竞争优势的框架,在此基础上,本课程的功能导向课程可以建立。它提供了公司最高管理团队(即首席执行官,总经理,部门经理)所面临的分析活动和决策情况的全貌,重点是与外部环境和内部问题有关的最高管理决策。它提供了认识到什么信息是重要的,筛选它的相关性,并利用知识为公司的竞争利益的实践经验。前提条件:进入联合EMBA项目。
{"title":"EMBA","authors":"Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang","doi":"10.1145/3337821.3337863","DOIUrl":"https://doi.org/10.1145/3337821.3337863","url":null,"abstract":"EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124916423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
JobPacker
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337880
Zhuozhao Li, Haiying Shen
In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.
{"title":"JobPacker","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1145/3337821.3337880","DOIUrl":"https://doi.org/10.1145/3337821.3337880","url":null,"abstract":"In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cosin
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337858
Jingya Zhou, Jianxi Fan, Jin Wang
Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.
{"title":"Cosin","authors":"Jingya Zhou, Jianxi Fan, Jin Wang","doi":"10.1145/3337821.3337858","DOIUrl":"https://doi.org/10.1145/3337821.3337858","url":null,"abstract":"Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations 电子结构计算中特征解的改进无约束能量泛函方法
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337914
M. D. Ben, O. Marques, A. Canning
This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.
本文报道了一种基于无约束能量泛函最小化格式的预条件共轭梯度迭代特征解的性能。与标准实现相比,该方案避免了试验特征向量的显式重新正交化,并成为解决非常大问题的有吸引力的替代方案。该无约束公式在第一性原理材料与化学CP2K代码中实现,该代码基于密度泛函理论近似求解多体Schrödinger方程进行电子结构计算。我们在国家能源研究科学计算中心(NERSC)的Cray XC40上研究了无约束公式的收敛性及其并行缩放。我们在研究中使用的系统是大量液态水,超分子催化剂金(III)配合物,MoS2-WSe2双分子层和硅的距离点缺陷,原子数从2,247到12,288不等。我们表明,具有适当前置条件的无约束公式具有良好的收敛特性,并且可以很好地扩展到23万核,大约占整机的38%。
{"title":"Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations","authors":"M. D. Ben, O. Marques, A. Canning","doi":"10.1145/3337821.3337914","DOIUrl":"https://doi.org/10.1145/3337821.3337914","url":null,"abstract":"This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning 面向快速机器学习的参数服务器的重叠计算与通信
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337828
Haozhao Wang, Song Guo, Ruixuan Li
When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.
当在参数服务器(PS)中运行时,分布式随机梯度下降(SGD)会导致严重的通信延迟,因为在推送更新之后,计算节点(工作节点)必须等待全局模型在每次迭代中从主服务器通信回来。本文设计了一种新的同步并行机制,即重叠同步并行(OSP),该机制通过以重叠的方式进行计算和通信来消除等待时间。从理论上证明,对于非凸问题,我们的机制可以达到与序列SGD相同的收敛速度。评估表明,我们的机制比最先进的机制显著提高了性能,例如,在收敛速度方面,AlexNet和ResNet18都提高了4倍。
{"title":"OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning","authors":"Haozhao Wang, Song Guo, Ruixuan Li","doi":"10.1145/3337821.3337828","DOIUrl":"https://doi.org/10.1145/3337821.3337828","url":null,"abstract":"When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
AdaM 亚当
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337822
Shiyi Cao, Yuanning Gao, Xiaofeng Gao, Guihai Chen
Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental "states" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed "states". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.
{"title":"AdaM","authors":"Shiyi Cao, Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337822","DOIUrl":"https://doi.org/10.1145/3337821.3337822","url":null,"abstract":"Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental \"states\" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed \"states\". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128190948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Machine Learning for Fine-Grained Hardware Prefetcher Control 细粒度硬件预取器控制的机器学习
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337854
Jason Hiebel, Laura E. Brown, Zhenlin Wang
Modern architectures provide hardware memory prefetching capabilities which can be configured at runtime. While hardware prefetching can provide substantial performance improvements for many programs, prefetching can also increase contention for shared resources such as last-level cache and memory bandwidth. In turn, this contention can degrade performance in multi-core workloads. In this paper, we model fine-grained hardware prefetcher control as a contextual bandit, and propose a framework for learning prefetcher control policies which adjust hardware prefetching usage at runtime according to workload performance behavior. We train our policies on profiling data, wherein hardware memory prefetchers are enabled or disabled randomly at regular intervals over the course of a workload's execution. The learned prefetcher control policies provide up to a 4.3% average performance improvement over a set of memory bandwidth intensive workloads.
现代架构提供硬件内存预取功能,可以在运行时配置。虽然硬件预取可以为许多程序提供实质性的性能改进,但预取也会增加对共享资源(如最后一级缓存和内存带宽)的争用。反过来,这种争用会降低多核工作负载中的性能。在本文中,我们将细粒度的硬件预取控制建模为上下文强盗,并提出了一个框架来学习预取控制策略,该策略可以根据工作负载性能行为在运行时调整硬件预取的使用。我们在分析数据上训练我们的策略,其中硬件内存预取器在工作负载的执行过程中以定期的间隔随机启用或禁用。学习到的预取器控制策略在一组内存带宽密集型工作负载上提供了高达4.3%的平均性能提升。
{"title":"Machine Learning for Fine-Grained Hardware Prefetcher Control","authors":"Jason Hiebel, Laura E. Brown, Zhenlin Wang","doi":"10.1145/3337821.3337854","DOIUrl":"https://doi.org/10.1145/3337821.3337854","url":null,"abstract":"Modern architectures provide hardware memory prefetching capabilities which can be configured at runtime. While hardware prefetching can provide substantial performance improvements for many programs, prefetching can also increase contention for shared resources such as last-level cache and memory bandwidth. In turn, this contention can degrade performance in multi-core workloads. In this paper, we model fine-grained hardware prefetcher control as a contextual bandit, and propose a framework for learning prefetcher control policies which adjust hardware prefetching usage at runtime according to workload performance behavior. We train our policies on profiling data, wherein hardware memory prefetchers are enabled or disabled randomly at regular intervals over the course of a workload's execution. The learned prefetcher control policies provide up to a 4.3% average performance improvement over a set of memory bandwidth intensive workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128817438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations 分布式记忆冰盖模拟中网格奇异点检测的并行图算法
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337841
Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota
We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.
我们提出了一种新的分布式内存并行算法,用于检测退化网格特征,这些特征可能导致冰盖网格模拟中的奇异性。识别和删除网格特征,如断开的组件(冰山)或铰链顶点(从陆地分离的冰半岛)可以显著提高迭代求解器的收敛性。由于冰盖在模拟过程中不断演变,因此重要的是,检测算法可以在模拟过程中就地运行——并行运行,计算时间可以忽略不计——以便在退化特征(例如,崩解的冰山)发展时可以检测到。我们提出了一种分布式内存,基于bfs的标签传播方法来退化特征检测,该方法足够高效,可以在冰盖模拟的每个步骤中调用,同时正确识别冰盖网格的所有退化特征。我们的方法在MPAS Albany Land Ice (MALI)模型的1536个核上,在0.0561秒内找到了包含1300万个顶点的网格中的所有退化特征。与之前使用的串行预处理方法相比,我们观察到我们的算法加速了46,000倍,并提供了在仿真中动态检测退化特征的额外能力。
{"title":"A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations","authors":"Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota","doi":"10.1145/3337821.3337841","DOIUrl":"https://doi.org/10.1145/3337821.3337841","url":null,"abstract":"We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the 48th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1