We address the problem of tuning the performance of the Java Virtual Machine (JVM) with run-time flags (parameters). We use the Hot Spot JVM in our study. As the Hot Spot JVM comes with over 600 flags to choose from, selecting a subset manually to maximize performance is infeasible. In prior work, the potential performance improvement is limited by the fact that only a subset of the tunable flags are tuned. We adopt a different approach and present the Hot Spot Auto-tuner which considers the entire JVM and the effect of all the flags. To the best of our knowledge, ours is the first auto-tuner for optimizing the performance of the JVM as a whole. We organize the JVM flags into a tree structure by building a flag-hierarchy, which helps us to resolve dependencies on aspects of the JVM such as garbage collector algorithms and JIT compilation, and helps to reduce the configuration search-space. Experiments with the SPECjvm2008 and DaCapo benchmarks show that we could optimize the Hot Spot JVM with significant speedup, 16 SPECjvm2008 startup programs were improved by an average of 19% with three of them improved dramatically by 63%, 51% and 32% within a maximum tuning time of 200 minutes for each. Based on a minimum tuning time of 200 minutes, average performance improvement for 13 DaCapo benchmark programs is 26% with 42% being the maximum improvement.
{"title":"Auto-Tuning the Java Virtual Machine","authors":"Sanath Jayasena, Milinda Fernando, Tharindu Rusira Patabandi, Chalitha Perera, C. Philips","doi":"10.1109/IPDPSW.2015.84","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.84","url":null,"abstract":"We address the problem of tuning the performance of the Java Virtual Machine (JVM) with run-time flags (parameters). We use the Hot Spot JVM in our study. As the Hot Spot JVM comes with over 600 flags to choose from, selecting a subset manually to maximize performance is infeasible. In prior work, the potential performance improvement is limited by the fact that only a subset of the tunable flags are tuned. We adopt a different approach and present the Hot Spot Auto-tuner which considers the entire JVM and the effect of all the flags. To the best of our knowledge, ours is the first auto-tuner for optimizing the performance of the JVM as a whole. We organize the JVM flags into a tree structure by building a flag-hierarchy, which helps us to resolve dependencies on aspects of the JVM such as garbage collector algorithms and JIT compilation, and helps to reduce the configuration search-space. Experiments with the SPECjvm2008 and DaCapo benchmarks show that we could optimize the Hot Spot JVM with significant speedup, 16 SPECjvm2008 startup programs were improved by an average of 19% with three of them improved dramatically by 63%, 51% and 32% within a maximum tuning time of 200 minutes for each. Based on a minimum tuning time of 200 minutes, average performance improvement for 13 DaCapo benchmark programs is 26% with 42% being the maximum improvement.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126519296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoming Lu, Jie Xu, Jieyan Liu, Bo Dai, Shenglin Gui, Siyu Zhan
In Fall 2013, we began participation in NSF/IEEE TCPP 2013 Early Adopters Program. This paper presents our efforts to incorporate parallel and distributed computing topics into our undergraduate computer science and engineering curriculum with the guide of the IEEE-TCPP model Curriculum. So far, TCPP recommended curriculum has been integrated eight courses, evaluations show our integration effort is successful. Evaluation also shows that practices such as lab/homework assignment effective improve students' conception, and contest club is a necessary complementarity of in-class courses.
{"title":"Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum at UESTC","authors":"Guoming Lu, Jie Xu, Jieyan Liu, Bo Dai, Shenglin Gui, Siyu Zhan","doi":"10.1109/IPDPSW.2015.66","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.66","url":null,"abstract":"In Fall 2013, we began participation in NSF/IEEE TCPP 2013 Early Adopters Program. This paper presents our efforts to incorporate parallel and distributed computing topics into our undergraduate computer science and engineering curriculum with the guide of the IEEE-TCPP model Curriculum. So far, TCPP recommended curriculum has been integrated eight courses, evaluations show our integration effort is successful. Evaluation also shows that practices such as lab/homework assignment effective improve students' conception, and contest club is a necessary complementarity of in-class courses.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122485595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Saini, Haoqiang Jin, D. Jespersen, Samson Cheung, M. J. Djomehri, Johnny Chang, R. Hood
We have conducted performance evaluation of a dual-rail Fourteen Data Rate (FDR) InfiniBand (IB) connected cluster, where each node has two Intel Xeon E5-2670 (Sandy Bridge) processors and two Intel Xeon Phi coprocessors. The Xeon Phi, based on the Many Integrated Core (MIC) architecture, is of the Knights Corner (KNC) generation. We used several types of benchmarks for the study. We ran the MPI and multi-zone versions of the NAS Parallel Benchmarks (NPB) -- both original and optimized for the Xeon Phi. Among the full-scale benchmarks, we ran two versions of WRF, including one optimized for the MIC, and used a 12 Km Continental U.S (CONUS) data set. We also used original and optimized versions of OVERFLOW and ran with four different datasets to understand scaling in symmetric mode and related load-balancing issues. We present performance for the four different modes of using the host + MIC combination: native host, native MIC, offload, and symmetric. We also discuss the various optimization techniques used in optimizing two of the NPBs for offload mode as well as WRF and OVERFLOW. WRF 3.4 optimized for MIC runs 47% faster than the original NCAR WRF 3.4. The optimized version of OVERFLOW runs 18% faster on the host and the load-balancing strategy used improves the performance on MIC by 5% to 36% depending on the data size. In addition, we discuss the issues related to offload mode and load balancing in symmetric mode.
{"title":"Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer","authors":"S. Saini, Haoqiang Jin, D. Jespersen, Samson Cheung, M. J. Djomehri, Johnny Chang, R. Hood","doi":"10.1109/IPDPSW.2015.140","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.140","url":null,"abstract":"We have conducted performance evaluation of a dual-rail Fourteen Data Rate (FDR) InfiniBand (IB) connected cluster, where each node has two Intel Xeon E5-2670 (Sandy Bridge) processors and two Intel Xeon Phi coprocessors. The Xeon Phi, based on the Many Integrated Core (MIC) architecture, is of the Knights Corner (KNC) generation. We used several types of benchmarks for the study. We ran the MPI and multi-zone versions of the NAS Parallel Benchmarks (NPB) -- both original and optimized for the Xeon Phi. Among the full-scale benchmarks, we ran two versions of WRF, including one optimized for the MIC, and used a 12 Km Continental U.S (CONUS) data set. We also used original and optimized versions of OVERFLOW and ran with four different datasets to understand scaling in symmetric mode and related load-balancing issues. We present performance for the four different modes of using the host + MIC combination: native host, native MIC, offload, and symmetric. We also discuss the various optimization techniques used in optimizing two of the NPBs for offload mode as well as WRF and OVERFLOW. WRF 3.4 optimized for MIC runs 47% faster than the original NCAR WRF 3.4. The optimized version of OVERFLOW runs 18% faster on the host and the load-balancing strategy used improves the performance on MIC by 5% to 36% depending on the data size. In addition, we discuss the issues related to offload mode and load balancing in symmetric mode.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122754939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Toinard, Timothee Ravier, C. Cérin, Yanik Ngoko
In this paper we deal with the cloud brokering problem in the context of a multi-cloud infrastructure. The problem is by nature a multi-criterion optimization problem. The focus is put mainly (but not only) on the security/trust criterion which is rarely considered in the litterature. We use the well known Promethee method to solve the problem which is original in the context of cloud brokering. In other words, if we give a high priority to the secure deployment of a service, are we still able to satisfy all of the others required QoS constraints? Reciprocally, if we give a high priority to the RTT (Round-Trip Time) constraint to access the Cloud, are we still able to ensure a weak/medium/strong 'security level'? We decided to stay at a high level of abstraction for the problem formulation and to conduct experiments using 'real' data. We believe that the design of the solution and the simulation tool we introduce in the paper are practical, thanks to the Promethee approach that has been used for more than 25 years but never, to our knowledge, for solving Cloud optimization problems. We expect that this study will be a first step to better understand, in the future, potential constraints in terms of control over external cloud services in order to implement them in a simple manner. The contributions of the paper are the modeling of an optimization problem with security constraints, the problem solving with the Promethee method and an experimental study aiming to play with multiple constraints to measure the impact of each constraint on the solution. During this process, we also provide a sensitive analysis of the Consensus Assessments Initiative Questionnaire by the Cloud Security Alliance (CSA). The analysis deals with the variety, balance and disparity of the questionnaire answers.
{"title":"The Promethee Method for Cloud Brokering with Trust and Assurance Criteria","authors":"C. Toinard, Timothee Ravier, C. Cérin, Yanik Ngoko","doi":"10.1109/IPDPSW.2015.63","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.63","url":null,"abstract":"In this paper we deal with the cloud brokering problem in the context of a multi-cloud infrastructure. The problem is by nature a multi-criterion optimization problem. The focus is put mainly (but not only) on the security/trust criterion which is rarely considered in the litterature. We use the well known Promethee method to solve the problem which is original in the context of cloud brokering. In other words, if we give a high priority to the secure deployment of a service, are we still able to satisfy all of the others required QoS constraints? Reciprocally, if we give a high priority to the RTT (Round-Trip Time) constraint to access the Cloud, are we still able to ensure a weak/medium/strong 'security level'? We decided to stay at a high level of abstraction for the problem formulation and to conduct experiments using 'real' data. We believe that the design of the solution and the simulation tool we introduce in the paper are practical, thanks to the Promethee approach that has been used for more than 25 years but never, to our knowledge, for solving Cloud optimization problems. We expect that this study will be a first step to better understand, in the future, potential constraints in terms of control over external cloud services in order to implement them in a simple manner. The contributions of the paper are the modeling of an optimization problem with security constraints, the problem solving with the Promethee method and an experimental study aiming to play with multiple constraints to measure the impact of each constraint on the solution. During this process, we also provide a sensitive analysis of the Consensus Assessments Initiative Questionnaire by the Cloud Security Alliance (CSA). The analysis deals with the variety, balance and disparity of the questionnaire answers.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124642362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Power is increasingly the limiting factor in High Performance Computing (HPC). Growing core counts in each generation increase power and energy demands. In the future, strict power and energy budgets will be used to control the operating costs of supercomputer centers. Every node needs to use energy wisely. Energy efficiency can either be improved by taking less time or running at lower power. In this paper, we use Dynamic Duty Cycle Modulation (DDCM) to improve energy efficiency by improving performance under a power bound. When the power is not capped, DDCM reduces processor power, saving energy and reducing processor temperature. DDCM allows the clock frequency to be controlled for each individual core with very low overhead. Any situation where the individual threads on a processor are exhibiting imbalance, a more balanced execution can be obtained by slowing the "fast" threads. We use time between MPI collectives and the waiting time at the collective to determine a thread's "near optimal" frequency. All changes are within the MPI library, introducing no user code changes or additional communication/synchronization. To test DDCM, a set of synthetic MPI programs with load imbalance were created. In addition, a couple of HPC MPI benchmarks with load imbalance were examined. In our experiments, DDCM saves up to 13.5% processor energy on one node and 20.8% on 16 nodes. By applying a power cap, DDCM effectively shifts power consumption between cores and improves overall performance. Performance improvements of 6.0% and 5.6% on one and 16 nodes, respectively, were observed.
{"title":"Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing","authors":"Sridutt Bhalachandra, Allan Porterfield, J. Prins","doi":"10.1109/IPDPSW.2015.144","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.144","url":null,"abstract":"Power is increasingly the limiting factor in High Performance Computing (HPC). Growing core counts in each generation increase power and energy demands. In the future, strict power and energy budgets will be used to control the operating costs of supercomputer centers. Every node needs to use energy wisely. Energy efficiency can either be improved by taking less time or running at lower power. In this paper, we use Dynamic Duty Cycle Modulation (DDCM) to improve energy efficiency by improving performance under a power bound. When the power is not capped, DDCM reduces processor power, saving energy and reducing processor temperature. DDCM allows the clock frequency to be controlled for each individual core with very low overhead. Any situation where the individual threads on a processor are exhibiting imbalance, a more balanced execution can be obtained by slowing the \"fast\" threads. We use time between MPI collectives and the waiting time at the collective to determine a thread's \"near optimal\" frequency. All changes are within the MPI library, introducing no user code changes or additional communication/synchronization. To test DDCM, a set of synthetic MPI programs with load imbalance were created. In addition, a couple of HPC MPI benchmarks with load imbalance were examined. In our experiments, DDCM saves up to 13.5% processor energy on one node and 20.8% on 16 nodes. By applying a power cap, DDCM effectively shifts power consumption between cores and improves overall performance. Performance improvements of 6.0% and 5.6% on one and 16 nodes, respectively, were observed.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124687879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The dependability of a networked software system S is a measure of how good S meets its service-level objectives in the presence of uncontrolled external environment conditions incident on S. The dependability attribute is hard to determine due to the inherent complexity of S, i.e., how the behavior of S is affected by the external environment conditions and the network resource availability is difficult to be accurately captured with mathematical models. The complexity arises from the large dimensionality of input parameter space and the interactions between various components in S. This leads to the employment of model-based control techniques that adapt the operations of S over multiple "observe-actuate" rounds and steer S towards a reference input QoS. How close is the actual QoS so achieved to the reference QoS is a measure of the dependability of S. Our paper studies model-based engineering methods to quantify the notion of dependability, and therein enhance the dependability of S. The paper provides a management architecture to dynamically evaluate the dependability of S, and adjust the plant and algorithm parameters to control the dependability.
{"title":"Dependability Modeling and Assessment of Complex Adaptive Networked Systems","authors":"K. Ravindran","doi":"10.1109/IPDPSW.2015.142","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.142","url":null,"abstract":"The dependability of a networked software system S is a measure of how good S meets its service-level objectives in the presence of uncontrolled external environment conditions incident on S. The dependability attribute is hard to determine due to the inherent complexity of S, i.e., how the behavior of S is affected by the external environment conditions and the network resource availability is difficult to be accurately captured with mathematical models. The complexity arises from the large dimensionality of input parameter space and the interactions between various components in S. This leads to the employment of model-based control techniques that adapt the operations of S over multiple \"observe-actuate\" rounds and steer S towards a reference input QoS. How close is the actual QoS so achieved to the reference QoS is a measure of the dependability of S. Our paper studies model-based engineering methods to quantify the notion of dependability, and therein enhance the dependability of S. The paper provides a management architecture to dynamically evaluate the dependability of S, and adjust the plant and algorithm parameters to control the dependability.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123322742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we show how to analytically model two widely used distributed matrix-multiply algorithms, Cannon's 2D and Johnson's 3D, implemented within the Intel Concurrent Collections framework for shared/distributed memory execution. Our precise analytical model proceeds by estimating the computation time and communication times, taking into account factors such as the block size, communication bandwidth, processor's peak performance, etc. It then applies a roofline-based approach to determine the running time based on communication/computation bottleneck estimation. Our models are validated by comparing the estimations to the measured run times varying the problem size and work distribution, showing only marginal differences. We conclude by using our model to perform a predictive analysis on the impact of improving the computation speed by a factor of 4×.
{"title":"A Roofline-Based Performance Estimator for Distributed Matrix-Multiply on Intel CnC","authors":"Martin Kong, L. Pouchet, P. Sadayappan","doi":"10.1109/IPDPSW.2015.134","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.134","url":null,"abstract":"In this paper we show how to analytically model two widely used distributed matrix-multiply algorithms, Cannon's 2D and Johnson's 3D, implemented within the Intel Concurrent Collections framework for shared/distributed memory execution. Our precise analytical model proceeds by estimating the computation time and communication times, taking into account factors such as the block size, communication bandwidth, processor's peak performance, etc. It then applies a roofline-based approach to determine the running time based on communication/computation bottleneck estimation. Our models are validated by comparing the estimations to the measured run times varying the problem size and work distribution, showing only marginal differences. We conclude by using our model to perform a predictive analysis on the impact of improving the computation speed by a factor of 4×.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115128783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. Heterogeneous designs are becoming ubiquitous across many new systemarchitectures as architects are turning to accelerators to deliver increased system performance and capability. In order to realize the potential of these heterogeneous designs, a framework is needed to support applications to take advantage of system capabilities. A suitable framework will depend on a combination of hardware primitives and software programming models. Together, hardware and software for heterogeneous multicore designs have to address the four "P"s of modern system design: Productivity, Portability, Performance, and Partitioning. To support current software deployment models, hardware primitives and programming models must ensure application isolation for partitioned virtual machine environments and application portability across a broad range of heterogeneous system design offerings. Hardware primitives and programming models must enable data center architects to build systems that continue scaling up performance and application developers to develop their applications with the necessary productivity. The Coherent Accelerator Processor Interface (CAPI) provides the basis for such a framework as integration point of accelerators into the POWER system architecture. CAPI accelerators can access application data directly using an integrated MMU. The CAPI MMU also provides partition isolation. Enabling accelerators to manage and pace their data access simplifies programming and prevents the CPUs from becoming serial bottlenecks. Finally, CAPI provides pointer identify, i.e., it enables the same address to be used in both CPU and accelerator to retrieve the same objects from memory. Pointer identity lays the foundation for high performance and high productivity accelerator programming models.
{"title":"PLC Keynote","authors":"M. Gschwind","doi":"10.1109/IPDPSW.2015.176","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.176","url":null,"abstract":"Summary form only given. Heterogeneous designs are becoming ubiquitous across many new systemarchitectures as architects are turning to accelerators to deliver increased system performance and capability. In order to realize the potential of these heterogeneous designs, a framework is needed to support applications to take advantage of system capabilities. A suitable framework will depend on a combination of hardware primitives and software programming models. Together, hardware and software for heterogeneous multicore designs have to address the four \"P\"s of modern system design: Productivity, Portability, Performance, and Partitioning. To support current software deployment models, hardware primitives and programming models must ensure application isolation for partitioned virtual machine environments and application portability across a broad range of heterogeneous system design offerings. Hardware primitives and programming models must enable data center architects to build systems that continue scaling up performance and application developers to develop their applications with the necessary productivity. The Coherent Accelerator Processor Interface (CAPI) provides the basis for such a framework as integration point of accelerators into the POWER system architecture. CAPI accelerators can access application data directly using an integrated MMU. The CAPI MMU also provides partition isolation. Enabling accelerators to manage and pace their data access simplifies programming and prevents the CPUs from becoming serial bottlenecks. Finally, CAPI provides pointer identify, i.e., it enables the same address to be used in both CPU and accelerator to retrieve the same objects from memory. Pointer identity lays the foundation for high performance and high productivity accelerator programming models.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133333421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.
{"title":"Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA","authors":"Shreyas G. Singapura, A. Panangadan, V. Prasanna","doi":"10.1109/IPDPSW.2015.133","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.133","url":null,"abstract":"Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132327401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous architectures with their diverse architectural features impose significant programmability challenges. Existing programming systems involve non-trivial learning and are not productive, not portable, and are challenging to tune for performance. In this paper, we introduce Heterogeneous Habanero-C (H2C), which is an implementation of the Habanero execution model for modern heterogeneous (CPU + GPU) architectures. The H2C language provides high-level constructs to specify the computation, communication, and synchronization in a given application. H2C also implements novel constructs for task partitioning and locality. The H2C (source-to-source) compiler and runtime framework efficiently map these high-level constructs onto the underlying heterogeneous platform, which can include multiple CPU cores and multiple GPU devices, possibly from different vendors. Experimental evaluations of four applications show significant improvements in productivity, portability, and performance.
具有不同体系结构特性的异构体系结构对可编程性提出了重大挑战。现有的编程系统涉及非平凡的学习,而且效率不高,不可移植,并且很难调优性能。在本文中,我们介绍了异构Habanero- c (H2C),它是现代异构(CPU + GPU)架构的Habanero执行模型的实现。H2C语言提供了高级结构来指定给定应用程序中的计算、通信和同步。H2C还实现了任务分区和局部性的新结构。H2C(源代码到源代码)编译器和运行时框架有效地将这些高级结构映射到底层异构平台上,该平台可以包括多个CPU内核和多个GPU设备,可能来自不同的供应商。对四个应用程序的实验评估显示了在生产力、可移植性和性能方面的显著改进。
{"title":"Heterogeneous Habanero-C (H2C): A Portable Programming Model for Heterogeneous Processors","authors":"Deepak Majeti, Vivek Sarkar","doi":"10.1109/IPDPSW.2015.81","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.81","url":null,"abstract":"Heterogeneous architectures with their diverse architectural features impose significant programmability challenges. Existing programming systems involve non-trivial learning and are not productive, not portable, and are challenging to tune for performance. In this paper, we introduce Heterogeneous Habanero-C (H2C), which is an implementation of the Habanero execution model for modern heterogeneous (CPU + GPU) architectures. The H2C language provides high-level constructs to specify the computation, communication, and synchronization in a given application. H2C also implements novel constructs for task partitioning and locality. The H2C (source-to-source) compiler and runtime framework efficiently map these high-level constructs onto the underlying heterogeneous platform, which can include multiple CPU cores and multiple GPU devices, possibly from different vendors. Experimental evaluations of four applications show significant improvements in productivity, portability, and performance.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116384732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}