Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00100
M. Forsell, J. Roivainen, J. Träff
The thick control flow (TCF) model is a data parallel abstraction of the thread model. It merges homogeneous threads (called fibers) flowing through the same control path to entities (called TCFs) with a single control flow and multiple data flows. Fibers of a TCF are executed synchronously with respect to each other and the number of them can be altered dynamically at runtime. Multiple TCFs can be executed in parallel to support control parallelism. In our previous work, we have outlined a special architecture, TPA (Thick control flow Processor Architecture), for executing TCF programs efficiently and shown that designing algorithms with the TCF model often leads to increased performance and simplified programs due to higher abstraction, eliminated loops and redundant program elements.Compute-update memory operations, such as multioperations and atomic instructions, are known to speed up parallel algorithms performing reductions and synchronizations. In this paper, we propose special compute-update memory operations for TCF processors to optimize iterative exclusive inter-fiber memory access patterns. Acceleration is achieved, e.g., in matrix addition and log-prefix style patterns in which multiple target locations can interchange data without reloads between the instructions that slows down execution. Our solution is based on modified active memory units and special memory operations that can send their reply value to another fiber than that initiating the access. We implement these operations in our TPA processor with a minimal HW cost and show that the expected speedups are achieved. Programming examples are given.
{"title":"Optimizing Memory Access in TCF Processors with Compute-Update Operations","authors":"M. Forsell, J. Roivainen, J. Träff","doi":"10.1109/IPDPSW50202.2020.00100","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00100","url":null,"abstract":"The thick control flow (TCF) model is a data parallel abstraction of the thread model. It merges homogeneous threads (called fibers) flowing through the same control path to entities (called TCFs) with a single control flow and multiple data flows. Fibers of a TCF are executed synchronously with respect to each other and the number of them can be altered dynamically at runtime. Multiple TCFs can be executed in parallel to support control parallelism. In our previous work, we have outlined a special architecture, TPA (Thick control flow Processor Architecture), for executing TCF programs efficiently and shown that designing algorithms with the TCF model often leads to increased performance and simplified programs due to higher abstraction, eliminated loops and redundant program elements.Compute-update memory operations, such as multioperations and atomic instructions, are known to speed up parallel algorithms performing reductions and synchronizations. In this paper, we propose special compute-update memory operations for TCF processors to optimize iterative exclusive inter-fiber memory access patterns. Acceleration is achieved, e.g., in matrix addition and log-prefix style patterns in which multiple target locations can interchange data without reloads between the instructions that slows down execution. Our solution is based on modified active memory units and special memory operations that can send their reply value to another fiber than that initiating the access. We implement these operations in our TPA processor with a minimal HW cost and show that the expected speedups are achieved. Programming examples are given.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134377914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00170
Naw Safrin Sattar, Shaikh Anfuzzaman
Sparse Deep Neural Network (DNN) is an emerging research area since deploying deep neural networks with limited resources is very challenging. In this work, we provide a scalable solution to the Sparse DNN Challenge-a challenge posed by MIT/IEEE/Amazon GraphChallenge.org-by designing data parallelism on GPUs. We provide a solution based on Python TensorFlow as it is a widely used tool in different scientific applications for deep learning. We use the datasets provided by GraphChallenge, derived from the MNIST handwritten letters. We use the Synthetic DNNs from RadiX-Net with varying number of neurons and layers. We implement a data parallel implementation of Sparse DNN using TensorFlow on GPU. Our solution shows up to 4.7× speedup over the basehne serial MATLAB implementation given in GraphChallenge. In addition to that, our TensorFlow GPU implementation demonstrates a 3-fold speedup over our TensorFloW CPU implementation.
{"title":"Data Parallel Large Sparse Deep Neural Network on GPU","authors":"Naw Safrin Sattar, Shaikh Anfuzzaman","doi":"10.1109/IPDPSW50202.2020.00170","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00170","url":null,"abstract":"Sparse Deep Neural Network (DNN) is an emerging research area since deploying deep neural networks with limited resources is very challenging. In this work, we provide a scalable solution to the Sparse DNN Challenge-a challenge posed by MIT/IEEE/Amazon GraphChallenge.org-by designing data parallelism on GPUs. We provide a solution based on Python TensorFlow as it is a widely used tool in different scientific applications for deep learning. We use the datasets provided by GraphChallenge, derived from the MNIST handwritten letters. We use the Synthetic DNNs from RadiX-Net with varying number of neurons and layers. We implement a data parallel implementation of Sparse DNN using TensorFlow on GPU. Our solution shows up to 4.7× speedup over the basehne serial MATLAB implementation given in GraphChallenge. In addition to that, our TensorFlow GPU implementation demonstrates a 3-fold speedup over our TensorFloW CPU implementation.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133349207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00133
Mayuko Koezuka, Yusuke Shirota, S. Shirai, Tatsunori Kanai
Demand for in-memory data processing of large-scale data is expanding, and expectations for storage-class memories (SCMs) are increasing accordingly. SCM achieves low standby power and higher density compared to DRAM. However, SCM is relatively slower than DRAM and requires more dynamic power. Therefore, it is necessary to improve speeds and reduce power usage by SCM by performing memory hierarchical control such as power-efficient prefetch control according to application memory access characteristics. However, such memory hierarchical control is complicated, making it difficult to determine an optimal memory control. Therefore, we propose an auto-tuning framework for dynamically predicting optimal memory control for SCM main memory system using machine learning based on system-level time series performance data. In this paper, we describe application of the proposed framework to prefetch control and evaluate the feasibility of power-efficient prefetch control. The results confirm automatic generation of prediction models reflecting domain knowledge of computer systems, allowing high-speed low-power real-time memory control.
{"title":"Machine Learning-Based Prefetching for SCM Main Memory System","authors":"Mayuko Koezuka, Yusuke Shirota, S. Shirai, Tatsunori Kanai","doi":"10.1109/ipdpsw50202.2020.00133","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00133","url":null,"abstract":"Demand for in-memory data processing of large-scale data is expanding, and expectations for storage-class memories (SCMs) are increasing accordingly. SCM achieves low standby power and higher density compared to DRAM. However, SCM is relatively slower than DRAM and requires more dynamic power. Therefore, it is necessary to improve speeds and reduce power usage by SCM by performing memory hierarchical control such as power-efficient prefetch control according to application memory access characteristics. However, such memory hierarchical control is complicated, making it difficult to determine an optimal memory control. Therefore, we propose an auto-tuning framework for dynamically predicting optimal memory control for SCM main memory system using machine learning based on system-level time series performance data. In this paper, we describe application of the proposed framework to prefetch control and evaluate the feasibility of power-efficient prefetch control. The results confirm automatic generation of prediction models reflecting domain knowledge of computer systems, allowing high-speed low-power real-time memory control.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"539 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133423860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00062
C. Anglano, M. Canonico, Marco Guazzone
Teaching Cloud Computing is becoming crucial since this recent computing paradigm is used in many fields and it is changing the way we use the applications and the technology. As a matter of the fact, most of the applications that we use everyday through the web are based on cloud services. Unfortunately, the difficulty to set up a real testbed for students and, at the same time, the lack of an easy, open and collaborative educational material freely available make teaching Cloud Computing a hard task. In this paper we discuss the state of the art concerning teaching Cloud Computing and we propose education materials and tools that make Cloud Computing easy to use even for students/educators without any computer science skills.
{"title":"Teaching Cloud Computing: Motivations, Challenges and Tools","authors":"C. Anglano, M. Canonico, Marco Guazzone","doi":"10.1109/IPDPSW50202.2020.00062","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00062","url":null,"abstract":"Teaching Cloud Computing is becoming crucial since this recent computing paradigm is used in many fields and it is changing the way we use the applications and the technology. As a matter of the fact, most of the applications that we use everyday through the web are based on cloud services. Unfortunately, the difficulty to set up a real testbed for students and, at the same time, the lack of an easy, open and collaborative educational material freely available make teaching Cloud Computing a hard task. In this paper we discuss the state of the art concerning teaching Cloud Computing and we propose education materials and tools that make Cloud Computing easy to use even for students/educators without any computer science skills.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00132
I. Chung, K. Komatsu
iWAPT (International Workshop on Automatic Performance Tuning) is a series of workshops that focus on research and techniques related to performance sustainability issues. The series provides an opportunity for researchers and users of automatic performance tuning (AT) technologies to exchange ideas and experiences acquired when applying such technologies to improve the performance of algorithms, libraries, and applications; in particular, on cutting edge computing platforms. Topics of interest include performance modeling; adaptive algorithms; autotuned numerical algorithms; libraries and scientific applications; empirical compilation; automated code generation; frameworks and theories of AT and software optimization; autonomic computing; and context-aware computing.
{"title":"Workshop 14: iWAPT Automatic Performance Tuning","authors":"I. Chung, K. Komatsu","doi":"10.1109/IPDPSW50202.2020.00132","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00132","url":null,"abstract":"iWAPT (International Workshop on Automatic Performance Tuning) is a series of workshops that focus on research and techniques related to performance sustainability issues. The series provides an opportunity for researchers and users of automatic performance tuning (AT) technologies to exchange ideas and experiences acquired when applying such technologies to improve the performance of algorithms, libraries, and applications; in particular, on cutting edge computing platforms. Topics of interest include performance modeling; adaptive algorithms; autotuned numerical algorithms; libraries and scientific applications; empirical compilation; automated code generation; frameworks and theories of AT and software optimization; autonomic computing; and context-aware computing.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116234667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00114
R. Barrett, Jeanine E. Cook, Stephen L. Olivier, O. Aaziz, Chris Jenkins, C. Vaughan
A broad set of data science and engineering questions may be organized as graphs, providing a powerful means for describing relational data. Although experts now routinely compute graph algorithms on huge, unstructured graphs using high performance computing (HPC) or cloud resources, this practice hasn’t yet broken into the mainstream. Such computations require great expertise, yet users often need rapid prototyping and development to quickly customize existing code. Toward that end, we are exploring the use of the Chapel programming language as a means of making some important graph analytics more accessible, examining the breadth of characteristics that would make for a productive programming environment, one that is expressive, performant, portable, and robust.
{"title":"Exploring Chapel Productivity Using Some Graph Algorithms","authors":"R. Barrett, Jeanine E. Cook, Stephen L. Olivier, O. Aaziz, Chris Jenkins, C. Vaughan","doi":"10.1109/IPDPSW50202.2020.00114","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00114","url":null,"abstract":"A broad set of data science and engineering questions may be organized as graphs, providing a powerful means for describing relational data. Although experts now routinely compute graph algorithms on huge, unstructured graphs using high performance computing (HPC) or cloud resources, this practice hasn’t yet broken into the mainstream. Such computations require great expertise, yet users often need rapid prototyping and development to quickly customize existing code. Toward that end, we are exploring the use of the Chapel programming language as a means of making some important graph analytics more accessible, examining the breadth of characteristics that would make for a productive programming environment, one that is expressive, performant, portable, and robust.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114451756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00025
M. Langhammer, Gregg Baeckler, Sergey Gribok
In this paper we introduce SpiderWeb, a new methodology for building high speed soft networks on FPGAs. There are many reasons why greater internal bandwidth is an increasingly important issue for FPGAs. Compute density is rapidly growing on FGPA, from historical precisions such as single precision floating point, to the massive parallel low precision operations required by machine learning inference. It is difficult for current FPGA fabrics, with designs developed using standard methods and tool flows, to provide a reliable way of generating wide and/or high speed data distribution busses. In contrast, SpiderWeb uses a specific NoC generation methodology which provides a predictable area and performance for these structures, with area and speed accurately known before compile time. The generated NoCs can be incorporated into large, complex designs, implemented with standard design flows, without compromising routability of the system.
{"title":"SpiderWeb - High Performance FPGA NoC","authors":"M. Langhammer, Gregg Baeckler, Sergey Gribok","doi":"10.1109/IPDPSW50202.2020.00025","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00025","url":null,"abstract":"In this paper we introduce SpiderWeb, a new methodology for building high speed soft networks on FPGAs. There are many reasons why greater internal bandwidth is an increasingly important issue for FPGAs. Compute density is rapidly growing on FGPA, from historical precisions such as single precision floating point, to the massive parallel low precision operations required by machine learning inference. It is difficult for current FPGA fabrics, with designs developed using standard methods and tool flows, to provide a reliable way of generating wide and/or high speed data distribution busses. In contrast, SpiderWeb uses a specific NoC generation methodology which provides a predictable area and performance for these structures, with area and speed accurately known before compile time. The generated NoCs can be incorporated into large, complex designs, implemented with standard design flows, without compromising routability of the system.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132281829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00122
R. Couturier, P. Strazdins, E. Aubanel, S. Roller, L. Yang, T. Rauber, G. Rünger
The technological trends in HPC system evolution indicates an increasing burden placed on application developers due to the management of the unprecedented complexity levels of hardware and its associated performance characteristics. Many existing scientific applications codes are unlikely to perform well on future systems without major modifications or even complete rewrites. In the future, it will be necessary to utilize, in concert, many characteristics such as multiple levels of parallelism, many lightweight cores, complex memory hierarchies, novel I/O technology, power capping, system-wide temporal/spatial performance heterogeneity and reliability concerns. The parallel and distributed computing (PDC) community has developed new programming models, algorithms, libraries and tools to meet these challenges in order to accommodate productive code development and effective system use. However, the scientific application community still needs to identify the benefit through practical evaluations. Thus, the focus of this workshop is on methodologies and experiences used in scientific and engineering applications and algorithms to achieve sustainable code development for better productivity, application performance and reliability.
{"title":"Workshop 13: PDSEC Parallel and Distributed Scientific and Engineering Computing","authors":"R. Couturier, P. Strazdins, E. Aubanel, S. Roller, L. Yang, T. Rauber, G. Rünger","doi":"10.1109/IPDPSW50202.2020.00122","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00122","url":null,"abstract":"The technological trends in HPC system evolution indicates an increasing burden placed on application developers due to the management of the unprecedented complexity levels of hardware and its associated performance characteristics. Many existing scientific applications codes are unlikely to perform well on future systems without major modifications or even complete rewrites. In the future, it will be necessary to utilize, in concert, many characteristics such as multiple levels of parallelism, many lightweight cores, complex memory hierarchies, novel I/O technology, power capping, system-wide temporal/spatial performance heterogeneity and reliability concerns. The parallel and distributed computing (PDC) community has developed new programming models, algorithms, libraries and tools to meet these challenges in order to accommodate productive code development and effective system use. However, the scientific application community still needs to identify the benefit through practical evaluations. Thus, the focus of this workshop is on methodologies and experiences used in scientific and engineering applications and algorithms to achieve sustainable code development for better productivity, application performance and reliability.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133917965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00166
Minsik Cho
Large-scale deep learning training has made significant progress in the last few years: more powerful systems/accelerators are delivered (i.e., Summit cluster), innovative training mechanisms are designed (i.e., sophisticated hyper-parm tuning), and advantage communication techniques are exercised (i.e., async-SGD). However, deep learning inference has rather limited options when it comes to scaling up the model density per device. Quantization to lower precision can be helpful along with sparsification such as pruning and compression yet suffers from the underlying hardware architecture and efficacy.
{"title":"Scalable Deep Learning Inference: Algorithmic Approach","authors":"Minsik Cho","doi":"10.1109/IPDPSW50202.2020.00166","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00166","url":null,"abstract":"Large-scale deep learning training has made significant progress in the last few years: more powerful systems/accelerators are delivered (i.e., Summit cluster), innovative training mechanisms are designed (i.e., sophisticated hyper-parm tuning), and advantage communication techniques are exercised (i.e., async-SGD). However, deep learning inference has rather limited options when it comes to scaling up the model density per device. Quantization to lower precision can be helpful along with sparsification such as pruning and compression yet suffers from the underlying hardware architecture and efficacy.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134121964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00084
Jacob Lambert, Seyong Lee, J. Vetter, A. Malony
The reconfigurable computing paradigm that uses field programmable gate arrays (FPGAs) has received renewed interest in the high-performance computing field due to FPGAs’ unique combination of performance and energy efficiency. However, difficulties in programming and optimizing FPGAs have prevented them from being widely accepted as general-purpose computing devices. In accelerator-based heterogeneous computing, portability across diverse heterogeneous devices is also an important issue, but the unique architectural features in FPGAs make this difficult to achieve. To address these issues, a directive-based, high-level FPGA programming and optimization framework was previously developed. In this work, developed optimizations were combined holistically using the directive-based approach to show that each individual benchmark requires a unique set of optimizations to maximize performance. The relationships between FPGA resource usages and runtime performance were also explored.
{"title":"In-Depth Optimization with the OpenACC-to-FPGA Framework on an Arria 10 FPGA","authors":"Jacob Lambert, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/IPDPSW50202.2020.00084","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00084","url":null,"abstract":"The reconfigurable computing paradigm that uses field programmable gate arrays (FPGAs) has received renewed interest in the high-performance computing field due to FPGAs’ unique combination of performance and energy efficiency. However, difficulties in programming and optimizing FPGAs have prevented them from being widely accepted as general-purpose computing devices. In accelerator-based heterogeneous computing, portability across diverse heterogeneous devices is also an important issue, but the unique architectural features in FPGAs make this difficult to achieve. To address these issues, a directive-based, high-level FPGA programming and optimization framework was previously developed. In this work, developed optimizations were combined holistically using the directive-based approach to show that each individual benchmark requires a unique set of optimizations to maximize performance. The relationships between FPGA resource usages and runtime performance were also explored.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116193973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}