Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00119
Elliot Ronaghan
This talk will highlight optimizations made to Arkouda, a Python package backed by Chapel that provides a key subset of the popular NumPy and Pandas interfaces at HPC scales. Optimizations such as aggregating communication have significantly improved Arkouda’s performance across a wide range of architectures. Key optimizations and benchmark results will be shown on architectures including a single node server, Ethernet and InfiniBand clusters, and a 512 node Cray supercomputer.
{"title":"Squeezing performance out of Arkouda","authors":"Elliot Ronaghan","doi":"10.1109/IPDPSW50202.2020.00119","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00119","url":null,"abstract":"This talk will highlight optimizations made to Arkouda, a Python package backed by Chapel that provides a key subset of the popular NumPy and Pandas interfaces at HPC scales. Optimizations such as aggregating communication have significantly improved Arkouda’s performance across a wide range of architectures. Key optimizations and benchmark results will be shown on architectures including a single node server, Ethernet and InfiniBand clusters, and a 512 node Cray supercomputer.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121600005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00081
Zheming Jin, H. Finkel
Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.
{"title":"Population Count on Intel® CPU, GPU and FPGA","authors":"Zheming Jin, H. Finkel","doi":"10.1109/IPDPSW50202.2020.00081","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00081","url":null,"abstract":"Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131457300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00102
Gabriel Bathie, L. Marchal, Y. Robert, Samuel Thibault
This work focuses on dynamic DAG scheduling under memory constraints. We target a shared-memory platform equipped with p parallel processors. We aim at bounding the maximum amount of memory that may be needed by any schedule using p processors to execute the DAG. We refine the classical model that computes maximum cuts by introducing two types of memory edges in the DAG, black edges for regular precedence constraints and red edges for actual memory consumption during execution. A valid edge cut cannot include more than p red edges. This limitation had never been taken into account in previous works, and dramatically changes the complexity of the problem, which was polynomial and becomes NP-hard. We introduce an Integer Linear Program (ILP) to solve it, together with an efficient heuristic based on rounding the rational solution of the ILP. In addition, we propose an exact polynomial algorithm for series-parallel graphs. We provide an extensive set of experiments, both with randomly-generated graphs and with graphs arising form practical applications, which demonstrate the impact of resource constraints on peak memory usage.
{"title":"Revisiting dynamic DAG scheduling under memory constraints for shared-memory platforms","authors":"Gabriel Bathie, L. Marchal, Y. Robert, Samuel Thibault","doi":"10.1109/IPDPSW50202.2020.00102","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00102","url":null,"abstract":"This work focuses on dynamic DAG scheduling under memory constraints. We target a shared-memory platform equipped with p parallel processors. We aim at bounding the maximum amount of memory that may be needed by any schedule using p processors to execute the DAG. We refine the classical model that computes maximum cuts by introducing two types of memory edges in the DAG, black edges for regular precedence constraints and red edges for actual memory consumption during execution. A valid edge cut cannot include more than p red edges. This limitation had never been taken into account in previous works, and dramatically changes the complexity of the problem, which was polynomial and becomes NP-hard. We introduce an Integer Linear Program (ILP) to solve it, together with an efficient heuristic based on rounding the rational solution of the ILP. In addition, we propose an exact polynomial algorithm for series-parallel graphs. We provide an extensive set of experiments, both with randomly-generated graphs and with graphs arising form practical applications, which demonstrate the impact of resource constraints on peak memory usage.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134441748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00121
Rahul Ghangas, Josh Milthorpe
Chapel’s high level data-parallel constructs make parallel programming productive for general programmers. This talk introduces the “Chapel on Accelerators” project, which proposes compiler enhancements to extend data-parallel constructs to hardware accelerators including GPUs. Previous attempts to extend Chapel to GPUs [1]–[3] have not been successfully integrated, and any such extension needs to maintain portability and consistency with the Chapel design philosophy and implementation.
Chapel的高级数据并行结构使并行编程对普通程序员来说非常高效。本演讲介绍了“Chapel on Accelerators”项目,该项目提出了编译器增强功能,将数据并行结构扩展到包括gpu在内的硬件加速器。以前将Chapel扩展到gpu的尝试[1]-[3]都没有成功集成,任何这样的扩展都需要保持可移植性和与Chapel设计理念和实现的一致性。
{"title":"Chapel on Accelerators","authors":"Rahul Ghangas, Josh Milthorpe","doi":"10.1109/IPDPSW50202.2020.00121","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00121","url":null,"abstract":"Chapel’s high level data-parallel constructs make parallel programming productive for general programmers. This talk introduces the “Chapel on Accelerators” project, which proposes compiler enhancements to extend data-parallel constructs to hardware accelerators including GPUs. Previous attempts to extend Chapel to GPUs [1]–[3] have not been successfully integrated, and any such extension needs to maintain portability and consistency with the Chapel design philosophy and implementation.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132976521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00033
Lorenzo Di Tucci, Riyadh Baghdadi, Saman P. Amarasinghe, M. Santambrogio
The explosion of genomic data is fostering research in fields such as personalized medicine and agritech, raising the necessity of providing more performant, power-efficient and easy-to-use architectures. Devices such as GPUs and FPGAs, deliver major performance improvements, however, GPUs present no-table power consumption, while FPGAs lack programmability. In this paper, we present SALSA, a Domain-Specific Architecture for sequence alignment that is completely configurable, extensible and is based on the RISC-V ISA. SALSA delivers good performance even at 200 MHz, outperforming Rocket, an open-source core, and an Intel Xeon by factors up to 350x in performance and 790x in power efficiency.
{"title":"SALSA: A Domain Specific Architecture for Sequence Alignment","authors":"Lorenzo Di Tucci, Riyadh Baghdadi, Saman P. Amarasinghe, M. Santambrogio","doi":"10.1109/IPDPSW50202.2020.00033","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00033","url":null,"abstract":"The explosion of genomic data is fostering research in fields such as personalized medicine and agritech, raising the necessity of providing more performant, power-efficient and easy-to-use architectures. Devices such as GPUs and FPGAs, deliver major performance improvements, however, GPUs present no-table power consumption, while FPGAs lack programmability. In this paper, we present SALSA, a Domain-Specific Architecture for sequence alignment that is completely configurable, extensible and is based on the RISC-V ISA. SALSA delivers good performance even at 200 MHz, outperforming Rocket, an open-source core, and an Intel Xeon by factors up to 350x in performance and 790x in power efficiency.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130904241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00050
Tze Meng Low, Daniele G. Spampinato, Scott McMillan, Michel Pelletier
We show that a linear algebraic formulation of the Louvain method for community detection can be derived systematically from the linear algebraic definition of modularity. Using the pygraphblas interface, a high-level Python wrapper for the GraphBLAS C Application Programming Interface (API), we demonstrate that the linear algebraic formulation of the Louvain method can be rapidly implemented.
{"title":"Linear Algebraic Louvain Method in Python","authors":"Tze Meng Low, Daniele G. Spampinato, Scott McMillan, Michel Pelletier","doi":"10.1109/IPDPSW50202.2020.00050","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00050","url":null,"abstract":"We show that a linear algebraic formulation of the Louvain method for community detection can be derived systematically from the linear algebraic definition of modularity. Using the pygraphblas interface, a high-level Python wrapper for the GraphBLAS C Application Programming Interface (API), we demonstrate that the linear algebraic formulation of the Louvain method can be rapidly implemented.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133374830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00169
S. Vinod, M. Naveen, A. K. Patra, Anto Ajay Raj John
Deep Learning (DL) is a rapidly evolving field under the umbrella of Artificial Intelligence (AI) with proven real-world use cases in supervised and unsupervised learning tasks. As the complexity of the learning tasks increases, the DL models become deeper or wider with millions of parameters and use larger datasets. Neural networks like AmoebaNet with 557M parameters and GPT-2 with 1.5 billion parameters are some of the recent examples of large models. DL trainings are generally run on accelerated hardware such as GPUs, TPUs or FPGAs which can satisfy the high computational demands of the neural network training. But accelerators are limited in their memory capacities. Larger the models, larger the memory required while training them. Hence, large DL models and large datasets cannot fit into the limited memory available on GPUs. However, there are techniques designed to overcome this limitation like compression, using CPU memory as a data swap, recomputations within the GPUs etc. But the efficiency of each of these techniques also depends on the underneath system platform capabilities. In this paper we present the observations from our study of training large DL models using data swap method on different system platforms. This study showcases the characteristics of large models and presents the system viewpoint of large deep learning model training by studying the relation of the software techniques to the system platform used underneath. The results presented in the paper show that for training large Deep Learning models, communication link between CPU and GPU is critical and the training performance can be improved by using a platform with high bandwidth link for this communication. The results presented are based on two DL models, 3DUnetCNN model for medical image segmentation and DeepLabV3+ model for semantic image segmentation.
{"title":"Accelerating Towards Larger Deep Learning Models and Datasets – A System Platform View Point","authors":"S. Vinod, M. Naveen, A. K. Patra, Anto Ajay Raj John","doi":"10.1109/IPDPSW50202.2020.00169","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00169","url":null,"abstract":"Deep Learning (DL) is a rapidly evolving field under the umbrella of Artificial Intelligence (AI) with proven real-world use cases in supervised and unsupervised learning tasks. As the complexity of the learning tasks increases, the DL models become deeper or wider with millions of parameters and use larger datasets. Neural networks like AmoebaNet with 557M parameters and GPT-2 with 1.5 billion parameters are some of the recent examples of large models. DL trainings are generally run on accelerated hardware such as GPUs, TPUs or FPGAs which can satisfy the high computational demands of the neural network training. But accelerators are limited in their memory capacities. Larger the models, larger the memory required while training them. Hence, large DL models and large datasets cannot fit into the limited memory available on GPUs. However, there are techniques designed to overcome this limitation like compression, using CPU memory as a data swap, recomputations within the GPUs etc. But the efficiency of each of these techniques also depends on the underneath system platform capabilities. In this paper we present the observations from our study of training large DL models using data swap method on different system platforms. This study showcases the characteristics of large models and presents the system viewpoint of large deep learning model training by studying the relation of the software techniques to the system platform used underneath. The results presented in the paper show that for training large Deep Learning models, communication link between CPU and GPU is critical and the training performance can be improved by using a platform with high bandwidth link for this communication. The results presented are based on two DL models, 3DUnetCNN model for medical image segmentation and DeepLabV3+ model for semantic image segmentation.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132682275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00064
Dong Li, Heike Jagode
The 25th HIPS workshop, a full-day meeting on May 18th at the IEEE IPDPS 2020 conference in New Orleans (now virtual), focuses on high-level programming of multiprocessors, compute clusters, and massively parallel machines. Like previous workshops in the series, which was established in 1996, this event serves as a forum for research in the areas of parallel applications, language design, compilers, runtime systems, and programming tools. It provides a timely forum for scientists and engineers to present the latest ideas and findings in these rapidly changing fields. In our call for papers, we especially encouraged innovative approaches in the areas of emerging programming models for large-scale parallel systems and many-core architectures.
{"title":"Workshop 6: HIPS High-level Parallel Programming Models and Supportive Environments","authors":"Dong Li, Heike Jagode","doi":"10.1109/ipdpsw50202.2020.00064","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00064","url":null,"abstract":"The 25th HIPS workshop, a full-day meeting on May 18th at the IEEE IPDPS 2020 conference in New Orleans (now virtual), focuses on high-level programming of multiprocessors, compute clusters, and massively parallel machines. Like previous workshops in the series, which was established in 1996, this event serves as a forum for research in the areas of parallel applications, language design, compilers, runtime systems, and programming tools. It provides a timely forum for scientists and engineers to present the latest ideas and findings in these rapidly changing fields. In our call for papers, we especially encouraged innovative approaches in the areas of emerging programming models for large-scale parallel systems and many-core architectures.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"67 38","pages":"316-316"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141207606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00099
A. Benoit, Valentin Le Fèvre, P. Raghavan, Y. Robert, Hongyang Sun
This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the classical problem while assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in the classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations, using both synthetic jobs and log traces from the Mira supercomputer.
{"title":"Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs","authors":"A. Benoit, Valentin Le Fèvre, P. Raghavan, Y. Robert, Hongyang Sun","doi":"10.1109/IPDPSW50202.2020.00099","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00099","url":null,"abstract":"This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the classical problem while assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in the classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations, using both synthetic jobs and log traces from the Mira supercomputer.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128856237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00010
F. Ciorba
Welcome to the 29th International Heterogeneity in Computing Workshop (HCW). Heterogeneity is one of the most important aspects of modern and emerging parallel and distributed computing systems. Exposing and expressing software parallelism as well as efficiently managing and exploiting hardware parallelism in heterogeneous parallel and distributed computing systems represent both challenges and exciting opportunities for advancing scientific discovery and for impactful innovation.
{"title":"Message from the HCW Technical Program Committee Chair","authors":"F. Ciorba","doi":"10.1109/ipdpsw50202.2020.00010","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00010","url":null,"abstract":"Welcome to the 29th International Heterogeneity in Computing Workshop (HCW). Heterogeneity is one of the most important aspects of modern and emerging parallel and distributed computing systems. Exposing and expressing software parallelism as well as efficiently managing and exploiting hardware parallelism in heterogeneous parallel and distributed computing systems represent both challenges and exciting opportunities for advancing scientific discovery and for impactful innovation.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131308734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}