Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.
{"title":"A Tale of Two (Flow) Tables: Demystifying Rule Caching in OpenFlow Switches","authors":"Rui Li, Yu Pang, Jin Zhao, Xin Wang","doi":"10.1145/3337821.3337896","DOIUrl":"https://doi.org/10.1145/3337821.3337896","url":null,"abstract":"Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.
{"title":"Efficient Data-Parallel Primitives on Heterogeneous Systems","authors":"Zhuohang Lai, Qiong Luo, Xiaolong Xie","doi":"10.1145/3337821.3337920","DOIUrl":"https://doi.org/10.1145/3337821.3337920","url":null,"abstract":"Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125332264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang
EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.
{"title":"EMBA","authors":"Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang","doi":"10.1145/3337821.3337863","DOIUrl":"https://doi.org/10.1145/3337821.3337863","url":null,"abstract":"EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124916423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.
{"title":"JobPacker","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1145/3337821.3337880","DOIUrl":"https://doi.org/10.1145/3337821.3337880","url":null,"abstract":"In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.
{"title":"Cosin","authors":"Jingya Zhou, Jianxi Fan, Jin Wang","doi":"10.1145/3337821.3337858","DOIUrl":"https://doi.org/10.1145/3337821.3337858","url":null,"abstract":"Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.
{"title":"Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations","authors":"M. D. Ben, O. Marques, A. Canning","doi":"10.1145/3337821.3337914","DOIUrl":"https://doi.org/10.1145/3337821.3337914","url":null,"abstract":"This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.
{"title":"OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning","authors":"Haozhao Wang, Song Guo, Ruixuan Li","doi":"10.1145/3337821.3337828","DOIUrl":"https://doi.org/10.1145/3337821.3337828","url":null,"abstract":"When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental "states" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed "states". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.
{"title":"AdaM","authors":"Shiyi Cao, Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337822","DOIUrl":"https://doi.org/10.1145/3337821.3337822","url":null,"abstract":"Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental \"states\" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed \"states\". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128190948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern architectures provide hardware memory prefetching capabilities which can be configured at runtime. While hardware prefetching can provide substantial performance improvements for many programs, prefetching can also increase contention for shared resources such as last-level cache and memory bandwidth. In turn, this contention can degrade performance in multi-core workloads. In this paper, we model fine-grained hardware prefetcher control as a contextual bandit, and propose a framework for learning prefetcher control policies which adjust hardware prefetching usage at runtime according to workload performance behavior. We train our policies on profiling data, wherein hardware memory prefetchers are enabled or disabled randomly at regular intervals over the course of a workload's execution. The learned prefetcher control policies provide up to a 4.3% average performance improvement over a set of memory bandwidth intensive workloads.
{"title":"Machine Learning for Fine-Grained Hardware Prefetcher Control","authors":"Jason Hiebel, Laura E. Brown, Zhenlin Wang","doi":"10.1145/3337821.3337854","DOIUrl":"https://doi.org/10.1145/3337821.3337854","url":null,"abstract":"Modern architectures provide hardware memory prefetching capabilities which can be configured at runtime. While hardware prefetching can provide substantial performance improvements for many programs, prefetching can also increase contention for shared resources such as last-level cache and memory bandwidth. In turn, this contention can degrade performance in multi-core workloads. In this paper, we model fine-grained hardware prefetcher control as a contextual bandit, and propose a framework for learning prefetcher control policies which adjust hardware prefetching usage at runtime according to workload performance behavior. We train our policies on profiling data, wherein hardware memory prefetchers are enabled or disabled randomly at regular intervals over the course of a workload's execution. The learned prefetcher control policies provide up to a 4.3% average performance improvement over a set of memory bandwidth intensive workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128817438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota
We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.
我们提出了一种新的分布式内存并行算法,用于检测退化网格特征,这些特征可能导致冰盖网格模拟中的奇异性。识别和删除网格特征,如断开的组件(冰山)或铰链顶点(从陆地分离的冰半岛)可以显著提高迭代求解器的收敛性。由于冰盖在模拟过程中不断演变,因此重要的是,检测算法可以在模拟过程中就地运行——并行运行,计算时间可以忽略不计——以便在退化特征(例如,崩解的冰山)发展时可以检测到。我们提出了一种分布式内存,基于bfs的标签传播方法来退化特征检测,该方法足够高效,可以在冰盖模拟的每个步骤中调用,同时正确识别冰盖网格的所有退化特征。我们的方法在MPAS Albany Land Ice (MALI)模型的1536个核上,在0.0561秒内找到了包含1300万个顶点的网格中的所有退化特征。与之前使用的串行预处理方法相比,我们观察到我们的算法加速了46,000倍,并提供了在仿真中动态检测退化特征的额外能力。
{"title":"A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations","authors":"Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota","doi":"10.1145/3337821.3337841","DOIUrl":"https://doi.org/10.1145/3337821.3337841","url":null,"abstract":"We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}