Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00040
M. Nejat, M. Pericàs, P. Stenström
Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.
{"title":"QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems","authors":"M. Nejat, M. Pericàs, P. Stenström","doi":"10.1109/IPDPS.2019.00040","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00040","url":null,"abstract":"Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122184473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00094
Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson
We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, "small-world", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.
{"title":"Network Size Estimation in Small-World Networks Under Byzantine Faults","authors":"Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson","doi":"10.1109/IPDPS.2019.00094","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00094","url":null,"abstract":"We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, \"small-world\", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114633849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Becker, A. Dey, F. Lau, Gergely Záruba Vice-Chairs
We are pleased to announce an excellent technical program for the 6th International Conference on Pervasive Computing and Communications. The program covers a broad cross section of topics in pervasive computing and communications. This year, 160 papers were submitted for consideration to the program committee. As a result, the selection process was highly competitive, and the result is a program of high-quality papers.
{"title":"Message from the Program Chair and Vice Chairs","authors":"Christian Becker, A. Dey, F. Lau, Gergely Záruba Vice-Chairs","doi":"10.1109/PERCOM.2005.25","DOIUrl":"https://doi.org/10.1109/PERCOM.2005.25","url":null,"abstract":"We are pleased to announce an excellent technical program for the 6th International Conference on Pervasive Computing and Communications. The program covers a broad cross section of topics in pervasive computing and communications. This year, 160 papers were submitted for consideration to the program committee. As a result, the selection process was highly competitive, and the result is a program of high-quality papers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115752648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00080
P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling
We consider the following load balancing process for m tokens distributed arbitrarily among n nodes connected by a complete graph. In each time step a pair of nodes is selected uniformly at random. Let ℓ_1 and ℓ_2 be their respective number of tokens. The two nodes exchange tokens such that they have ⌈(ℓ_1 + ℓ_2)/2⌉ and ⌈(ℓ_1 + ℓ_2)/2⌉ tokens, respectively. We provide a simple analysis showing that this process reaches almost perfect balance within O(n log n + n log Δ) steps with high probability, where Δ is the maximal initial load difference between any two nodes. This bound is asymptotically tight.
我们考虑m个令牌随机分布在由完全图连接的n个节点上的负载平衡过程。在每个时间步长中均匀随机选择一对节点。设_1和_2是它们各自的符号数。这两个节点交换令牌,使它们分别具有< <(__1 + __2)/2²和< __1 + __2)/2²令牌。我们提供了一个简单的分析,表明该过程在O(n log n + n log Δ)步内以高概率达到几乎完美的平衡,其中Δ是任意两个节点之间的最大初始负载差。这个界是渐近紧的。
{"title":"Tight & Simple Load Balancing","authors":"P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling","doi":"10.1109/IPDPS.2019.00080","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00080","url":null,"abstract":"We consider the following load balancing process for m tokens distributed arbitrarily among n nodes connected by a complete graph. In each time step a pair of nodes is selected uniformly at random. Let ℓ_1 and ℓ_2 be their respective number of tokens. The two nodes exchange tokens such that they have ⌈(ℓ_1 + ℓ_2)/2⌉ and ⌈(ℓ_1 + ℓ_2)/2⌉ tokens, respectively. We provide a simple analysis showing that this process reaches almost perfect balance within O(n log n + n log Δ) steps with high probability, where Δ is the maximal initial load difference between any two nodes. This bound is asymptotically tight.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126936332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00026
M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek
When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.
{"title":"A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning","authors":"M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek","doi":"10.1109/IPDPS.2019.00026","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00026","url":null,"abstract":"When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00030
B. ShriramS, Anshuj Garg, Purushottam Kulkarni
Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.
{"title":"Dynamic Memory Management for GPU-Based Training of Deep Neural Networks","authors":"B. ShriramS, Anshuj Garg, Purushottam Kulkarni","doi":"10.1109/IPDPS.2019.00030","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00030","url":null,"abstract":"Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122060192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00101
E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis
Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.
{"title":"LLC-Guided Data Migration in Hybrid Memory Systems","authors":"E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis","doi":"10.1109/IPDPS.2019.00101","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00101","url":null,"abstract":"Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130977101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00086
Zhiqiang Zuo, Rong Gu, Xi Jiang, Zhaokang Wang, Yihua Huang, Linzhang Wang, Xuandong Li
Static program analysis is widely used in various application areas to solve many practical problems. Although researchers have made significant achievements in static analysis, it is still too challenging to perform sophisticated interprocedural analysis on large-scale modern software. The underlying reason is that interprocedural analysis for large-scale modern software is highly computation-and memory-intensive, leading to poor scalability. We aim to tackle the scalability problem by proposing a novel big data solution for sophisticated static analysis. Specifically, we propose a data-parallel algorithm and a join-process-filter computation model for the CFL-reachability based interprocedural analysis and develop an efficient distributed static analysis engine in the cloud, called BigSpa. Our experiments validated that BigSpa running on a cluster scales greatly to perform precise interprocedural analyses on millions of lines of code, and runs an order of magnitude or more faster than the existing state-of-the-art analysis tools.
{"title":"BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud","authors":"Zhiqiang Zuo, Rong Gu, Xi Jiang, Zhaokang Wang, Yihua Huang, Linzhang Wang, Xuandong Li","doi":"10.1109/IPDPS.2019.00086","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00086","url":null,"abstract":"Static program analysis is widely used in various application areas to solve many practical problems. Although researchers have made significant achievements in static analysis, it is still too challenging to perform sophisticated interprocedural analysis on large-scale modern software. The underlying reason is that interprocedural analysis for large-scale modern software is highly computation-and memory-intensive, leading to poor scalability. We aim to tackle the scalability problem by proposing a novel big data solution for sophisticated static analysis. Specifically, we propose a data-parallel algorithm and a join-process-filter computation model for the CFL-reachability based interprocedural analysis and develop an efficient distributed static analysis engine in the cloud, called BigSpa. Our experiments validated that BigSpa running on a cluster scales greatly to perform precise interprocedural analyses on millions of lines of code, and runs an order of magnitude or more faster than the existing state-of-the-art analysis tools.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114999472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00107
Thomas Macht, C. Grelck
SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.
{"title":"SAC Goes Cluster: Fully Implicit Distributed Computing","authors":"Thomas Macht, C. Grelck","doi":"10.1109/IPDPS.2019.00107","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00107","url":null,"abstract":"SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125173176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00081
L. DeRose
The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.
{"title":"The Path to Delivering Programable Exascale Systems","authors":"L. DeRose","doi":"10.1109/IPDPS.2019.00081","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00081","url":null,"abstract":"The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}