Pub Date : 2021-06-01DOI: 10.1109/ipdpsw52791.2021.00020
{"title":"Introduction to RAW 2021","authors":"","doi":"10.1109/ipdpsw52791.2021.00020","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00020","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"520 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125827625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/IPDPSW52791.2021.00032
Joel Mandebi Mbongue, S. Saha, C. Bobda
Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While commercial clouds still essentially expose single-tenant FPGAs to the users, the growing demand for hardware acceleration raises the need for architectures supporting FPGA multi-tenancy. In this work, we explore the trade-off between hardware consolidation and performance. Experiments show that FPGA multi-tenancy increases hardware utilization and decreases IO performance in the order of microseconds compared to the single-tenant model. The experiments also demonstrate that implementing on-chip communication between the hardware workloads of a cloud user significantly reduces the overall communication overhead.
{"title":"Performance Study of Multi-tenant Cloud FPGAs","authors":"Joel Mandebi Mbongue, S. Saha, C. Bobda","doi":"10.1109/IPDPSW52791.2021.00032","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00032","url":null,"abstract":"Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While commercial clouds still essentially expose single-tenant FPGAs to the users, the growing demand for hardware acceleration raises the need for architectures supporting FPGA multi-tenancy. In this work, we explore the trade-off between hardware consolidation and performance. Experiments show that FPGA multi-tenancy increases hardware utilization and decreases IO performance in the order of microseconds compared to the single-tenant model. The experiments also demonstrate that implementing on-chip communication between the hardware workloads of a cloud user significantly reduces the overall communication overhead.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127194254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/IPDPSW52791.2021.00099
Zheming Jin, J. Vetter
Sum reduction is a primitive operation in parallel computing while SYCL is a promising heterogeneous programming language. In this paper, we describe the SYCL implementations of integer sum reduction using atomic functions, shared local memory, vectorized memory accesses, and parameterized workload sizes. Evaluating the reduction kernels shows that we can achieve 1.4X speedup over the open-source implementations of sum reduction for a sufficiently large number of integers on an Intel integrated GPU.
{"title":"Evaluating the Performance of Integer Sum Reduction on an Intel GPU","authors":"Zheming Jin, J. Vetter","doi":"10.1109/IPDPSW52791.2021.00099","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00099","url":null,"abstract":"Sum reduction is a primitive operation in parallel computing while SYCL is a promising heterogeneous programming language. In this paper, we describe the SYCL implementations of integer sum reduction using atomic functions, shared local memory, vectorized memory accesses, and parameterized workload sizes. Evaluating the reduction kernels shows that we can achieve 1.4X speedup over the open-source implementations of sum reduction for a sufficiently large number of integers on an Intel integrated GPU.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124921534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/IPDPSW52791.2021.00087
M. Babenko, A. Tchernykh, Luis Bernardo Pulido-Gaytan, J. M. Cortés-Mendoza, Egor Shiryaev, E. Golimblevskaia, A. Avetisyan, S. Nesmachnow
Ensuring reliable data storage in a cloud environment is a challenging problem. One of the efficient mechanisms used to solve it is the Redundant Residue Number System (RRNS) with the projection method, a commonly used mechanism for detecting errors. However, the error correction based on the projection method has exponential complexity depending on the number of control and working moduli. In this paper, we propose an optimization mechanism using a base extension and Hamming distance to reduce the number of calculated projections. We show that they can be reduced up to three times than the classical projection method and, hence, the time complexity of data recovery in the distributed cloud data storage.
{"title":"RRNS Base Extension Error-Correcting Code for Performance Optimization of Scalable Reliable Distributed Cloud Data Storage","authors":"M. Babenko, A. Tchernykh, Luis Bernardo Pulido-Gaytan, J. M. Cortés-Mendoza, Egor Shiryaev, E. Golimblevskaia, A. Avetisyan, S. Nesmachnow","doi":"10.1109/IPDPSW52791.2021.00087","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00087","url":null,"abstract":"Ensuring reliable data storage in a cloud environment is a challenging problem. One of the efficient mechanisms used to solve it is the Redundant Residue Number System (RRNS) with the projection method, a commonly used mechanism for detecting errors. However, the error correction based on the projection method has exponential complexity depending on the number of control and working moduli. In this paper, we propose an optimization mechanism using a base extension and Hamming distance to reduce the number of calculated projections. We show that they can be reduced up to three times than the classical projection method and, hence, the time complexity of data recovery in the distributed cloud data storage.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/IPDPSW52791.2021.00089
Jonas Posner, Lukas Reitz, Claudia Fohry
With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.
{"title":"Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks","authors":"Jonas Posner, Lukas Reitz, Claudia Fohry","doi":"10.1109/IPDPSW52791.2021.00089","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00089","url":null,"abstract":"With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129583250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/ipdpsw52791.2021.00125
{"title":"Introduction to PAISE 2021","authors":"","doi":"10.1109/ipdpsw52791.2021.00125","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00125","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129656398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/IPDPSW52791.2021.00117
Naruya Kitai, D. Takahashi, F. Franchetti, T. Katagiri, S. Ohshima, Toru Nagai
In this paper, we propose an auto-tuning (AT) system by adapting the A64 Scalable Vector Extension for SPIRAL to generate discrete Fourier transform (DFT) implementations. The performance of our method is evaluated using the Supercomputer "Flow" at Nagoya University. The A64 scalable vector extension applied DFT codes are up to 1.98 times faster than scalar DFT codes and up to 3.63 times higher in terms of the SIMD instruction rate. In addition, we obtain a factor of maximum speedup 2.32 by adapting proposed AT system for loop unrolling.
{"title":"An Auto-tuning with Adaptation of A64 Scalable Vector Extension for SPIRAL","authors":"Naruya Kitai, D. Takahashi, F. Franchetti, T. Katagiri, S. Ohshima, Toru Nagai","doi":"10.1109/IPDPSW52791.2021.00117","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00117","url":null,"abstract":"In this paper, we propose an auto-tuning (AT) system by adapting the A64 Scalable Vector Extension for SPIRAL to generate discrete Fourier transform (DFT) implementations. The performance of our method is evaluated using the Supercomputer \"Flow\" at Nagoya University. The A64 scalable vector extension applied DFT codes are up to 1.98 times faster than scalar DFT codes and up to 3.63 times higher in terms of the SIMD instruction rate. In addition, we obtain a factor of maximum speedup 2.32 by adapting proposed AT system for loop unrolling.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132599558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/ipdpsw52791.2021.00160
{"title":"IPDPS 2021 PhD Forum Welcome and Abstracts","authors":"","doi":"10.1109/ipdpsw52791.2021.00160","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00160","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133352571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/ipdpsw52791.2021.00080
{"title":"Introduction to PDCO 2021","authors":"","doi":"10.1109/ipdpsw52791.2021.00080","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00080","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133371889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/IPDPSW52791.2021.00041
P. Barbera, A. Stamatakis
Maximum likelihood based phylogenetic methods score phylogenetic tree topologies comprising a set of molecular sequences of the species under study, using statistical models of evolution. The scoring procedure relies on storing intermediate results at inner nodes of the tree during the tree traversal. This induces comparatively high memory requirements compared to less compute-intensive methods such as parsimony, for instance.The memory requirements are particularly large for maximum likelihood phylogenetic placement, as further intermediate results should be stored at all branches of the tree to maximize runtime performance. This has hindered numerous users of our phylogenetic placement tool EPA-NG from performing placement on large phylogenetic trees.Here, we present an approach to reduce the memory footprint of EPA-NG. Further, we have generalized our implementation and integrated it into our phylogenetic likelihood library, libpll-2, such that it can be used by other tools for phylogenetic inference. On an empirical dataset, we were able to reduce the memory requirements by up to 96% at the cost of increasing execution times by 23 times. Hence, there exists a trade-off between decreasing memory requirements and increasing execution times which we investigate. When increasing the amount of memory available for placement to a certain level, execution times are only approximately 4 times lower for the most challenging dataset we have tested. This now allows for conducting maximum likelihood based placement on substantially larger trees within reasonable times. Finally, we show that the active memory management approach introduces new challenges for parallelization and outline possible solutions.
{"title":"Efficient Memory Management in Likelihood-based Phylogenetic Placement","authors":"P. Barbera, A. Stamatakis","doi":"10.1109/IPDPSW52791.2021.00041","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00041","url":null,"abstract":"Maximum likelihood based phylogenetic methods score phylogenetic tree topologies comprising a set of molecular sequences of the species under study, using statistical models of evolution. The scoring procedure relies on storing intermediate results at inner nodes of the tree during the tree traversal. This induces comparatively high memory requirements compared to less compute-intensive methods such as parsimony, for instance.The memory requirements are particularly large for maximum likelihood phylogenetic placement, as further intermediate results should be stored at all branches of the tree to maximize runtime performance. This has hindered numerous users of our phylogenetic placement tool EPA-NG from performing placement on large phylogenetic trees.Here, we present an approach to reduce the memory footprint of EPA-NG. Further, we have generalized our implementation and integrated it into our phylogenetic likelihood library, libpll-2, such that it can be used by other tools for phylogenetic inference. On an empirical dataset, we were able to reduce the memory requirements by up to 96% at the cost of increasing execution times by 23 times. Hence, there exists a trade-off between decreasing memory requirements and increasing execution times which we investigate. When increasing the amount of memory available for placement to a certain level, execution times are only approximately 4 times lower for the most challenging dataset we have tested. This now allows for conducting maximum likelihood based placement on substantially larger trees within reasonable times. Finally, we show that the active memory management approach introduces new challenges for parallelization and outline possible solutions.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128517737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}