We present a new, integrated approach to parallel performance analysis that integrates traditional application-oriented performance data with measurements of the physical runtime environment. We have developed the needed infrastructure for combined evaluation of system, application, and machine room performance in the high end environment. We illustrate the utility of our approach, with data from our study of the power and cooling impact of the choice of physical location for an application within the machine room. We demonstrate the integration of measured performance data from the application, system, and physical room environment, and discuss the challenges encountered.
{"title":"Integrating Power and Cooling into Parallel Performance Analysis","authors":"R. Knapp, K. Karavanic, A. Márquez","doi":"10.1109/ICPPW.2010.72","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.72","url":null,"abstract":"We present a new, integrated approach to parallel performance analysis that integrates traditional application-oriented performance data with measurements of the physical runtime environment. We have developed the needed infrastructure for combined evaluation of system, application, and machine room performance in the high end environment. We illustrate the utility of our approach, with data from our study of the power and cooling impact of the choice of physical location for an application within the machine room. We demonstrate the integration of measured performance data from the application, system, and physical room environment, and discuss the challenges encountered.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133472143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunye Gong, Jie Liu, Qiang Zhang, Haitao Chen, Z. Gong
Cloud computing emerges as one of the hottest topic in field of information technology. Cloud computing is based on several other computing research areas such as HPC, virtualization, utility computing and grid computing. In order to make clear the essential of cloud computing, we propose the characteristics of this area which make cloud computing being cloud computing and distinguish it from other research areas. The cloud computing has its own conceptional, technical, economic and user experience characteristics. The service oriented, loose coupling, strong fault tolerant, business model and ease use are main characteristics of cloud computing. Clear insights into cloud computing will help the development and adoption of this evolving technology both for academe and industry.
{"title":"The Characteristics of Cloud Computing","authors":"Chunye Gong, Jie Liu, Qiang Zhang, Haitao Chen, Z. Gong","doi":"10.1109/ICPPW.2010.45","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.45","url":null,"abstract":"Cloud computing emerges as one of the hottest topic in field of information technology. Cloud computing is based on several other computing research areas such as HPC, virtualization, utility computing and grid computing. In order to make clear the essential of cloud computing, we propose the characteristics of this area which make cloud computing being cloud computing and distinguish it from other research areas. The cloud computing has its own conceptional, technical, economic and user experience characteristics. The service oriented, loose coupling, strong fault tolerant, business model and ease use are main characteristics of cloud computing. Clear insights into cloud computing will help the development and adoption of this evolving technology both for academe and industry.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117236212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optimizing programs for Graphic Processing Unit (GPU) requires thorough knowledge about the values of architectural features for the new computing platform. However, this knowledge is frequently unavailable, e.g., due to insufficient documentation, which is probably a result of the infancy of general purpose computing on the GPU. What makes the modeling of program performance on GPU even more difficult is that the exact value of some “architectural” parameters on the GPU depends on how a GPU program interacts with those features. For example, AMD GPUs show different memory latencies when the memory is accessed with address sequences that have different patterns. Current micro-benchmark suites such as X-Ray are powerless for characterizing the GPU. Clearly, a preliminary for efficient code optimization and automatic tuning on the GPU is a systematic method to measure the architectural features and identify the most basic program characteristics that determine the performance of a program on the new GPU architectures. In this paper, we present a micro-benchmark suite for AMD GPUs that supports the AMD StreamSDK. Our model identifies and measures a series of architectural features and basic program characteristics that are most important and most predictive for program performance on the platform. The features and characteristics include vectorization, burst write latency, texture fetch latency, global read and write latency, ALU/Fetch operation ratio, domain size and register usage for both AMD’s pixel shader and compute shader modes. Our performance model not only generates correct values for those parameters, but also provides a clear picture of program performance on the GPU.
{"title":"A Micro-benchmark Suite for AMD GPUs","authors":"Ryan Taylor, Xiaoming Li","doi":"10.1109/ICPPW.2010.59","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.59","url":null,"abstract":"Optimizing programs for Graphic Processing Unit (GPU) requires thorough knowledge about the values of architectural features for the new computing platform. However, this knowledge is frequently unavailable, e.g., due to insufficient documentation, which is probably a result of the infancy of general purpose computing on the GPU. What makes the modeling of program performance on GPU even more difficult is that the exact value of some “architectural” parameters on the GPU depends on how a GPU program interacts with those features. For example, AMD GPUs show different memory latencies when the memory is accessed with address sequences that have different patterns. Current micro-benchmark suites such as X-Ray are powerless for characterizing the GPU. Clearly, a preliminary for efficient code optimization and automatic tuning on the GPU is a systematic method to measure the architectural features and identify the most basic program characteristics that determine the performance of a program on the new GPU architectures. In this paper, we present a micro-benchmark suite for AMD GPUs that supports the AMD StreamSDK. Our model identifies and measures a series of architectural features and basic program characteristics that are most important and most predictive for program performance on the platform. The features and characteristics include vectorization, burst write latency, texture fetch latency, global read and write latency, ALU/Fetch operation ratio, domain size and register usage for both AMD’s pixel shader and compute shader modes. Our performance model not only generates correct values for those parameters, but also provides a clear picture of program performance on the GPU.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127236866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao-Min Su, Jiader Chou, Chih-Wei Yi, Y. Tseng, Chia-Hung Tsai
The positioning technique is the key technique for developing geographic applications, like location based services. The Global Positioning System (GPS) is a common approach for positioning in vehicular navigations. Although GPS can provide absolute position information, the accuracy of GPS is not enough for personal navigations. What is worse, GPS does not work well indoors. Instead, Inertial Measurement Units (IMUs) can be used to track objects with high precision, but it provides relative position information. Thus, integration of GPS and IMU can do positioning indoors and outdoors. In this paper, combining our previous work, a pedestrian tracking system for handheld devices, with GPS leads to a personal navigation system for handheld devices. The position and heading information can be calculated from this system. The system also serves a platform for many applications related to the location.
{"title":"Sensor-Aided Personal Navigation Systems for Handheld Devices","authors":"Chao-Min Su, Jiader Chou, Chih-Wei Yi, Y. Tseng, Chia-Hung Tsai","doi":"10.1109/ICPPW.2010.78","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.78","url":null,"abstract":"The positioning technique is the key technique for developing geographic applications, like location based services. The Global Positioning System (GPS) is a common approach for positioning in vehicular navigations. Although GPS can provide absolute position information, the accuracy of GPS is not enough for personal navigations. What is worse, GPS does not work well indoors. Instead, Inertial Measurement Units (IMUs) can be used to track objects with high precision, but it provides relative position information. Thus, integration of GPS and IMU can do positioning indoors and outdoors. In this paper, combining our previous work, a pedestrian tracking system for handheld devices, with GPS leads to a personal navigation system for handheld devices. The position and heading information can be calculated from this system. The system also serves a platform for many applications related to the location.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130735248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Che-Yuan Tu, Wen-Chieh Kuo, Wei-Hua Teng, Yao-Tsung Wang, Steven Shiau
With the growing interest of cloud computing and carbon emission reduction, how to build energy efficient cloud architecture becomes a crisis issue for service providers. In this paper, we propose a power-aware cloud architecture based on DRBL (Diskless Remote Boot in Linux), cpufreqd and xenpm. We also introduce a low-cost smart metering system based on open hardware Arduino board. Composing with existing technique, such as Dynamic Voltage Frequency Scaling (DVFS), ACPI, diskless design and RAM disk storage, our experiment results show that this architecture will reduce energy consumption from 4 to 11% when running CPU-intensive applications. In conclusion, this paper reveals that service providers could benefit from diskless design and RAM disk storage if their applications are CPU-intensive.
{"title":"A Power-Aware Cloud Architecture with Smart Metering","authors":"Che-Yuan Tu, Wen-Chieh Kuo, Wei-Hua Teng, Yao-Tsung Wang, Steven Shiau","doi":"10.1109/ICPPW.2010.73","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.73","url":null,"abstract":"With the growing interest of cloud computing and carbon emission reduction, how to build energy efficient cloud architecture becomes a crisis issue for service providers. In this paper, we propose a power-aware cloud architecture based on DRBL (Diskless Remote Boot in Linux), cpufreqd and xenpm. We also introduce a low-cost smart metering system based on open hardware Arduino board. Composing with existing technique, such as Dynamic Voltage Frequency Scaling (DVFS), ACPI, diskless design and RAM disk storage, our experiment results show that this architecture will reduce energy consumption from 4 to 11% when running CPU-intensive applications. In conclusion, this paper reveals that service providers could benefit from diskless design and RAM disk storage if their applications are CPU-intensive.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115956062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mangesh K. Kunchamwar, Durga P. Prasad, Pawan Hegde, P. Balsara, R. Sangireddy
There is an increasing demand for converged solution for multi-standard radio processors to support existing and future standards. In this work, heterogeneous multi-processor platform is proposed for multi standard wireless communication system which is programmable and scalable in adapting to future standards. Channel decoding algorithms form important constituent of wireless communication system because of their computational complexity. A programmable radio processor is proposed for channel decoding with application specific instruction accelerators. Viterbi and Turbo channel decoding algorithms are analyzed for computational parallelism in the algorithms and for hardware reusability across the algorithms. Application specific instruction accelerator is designed by exploiting similar characteristics and computational parallelism across the algorithms. The analysis shows that the throughput of 54Mbps for UWB Viterbi Decoder and 12 Mbps for UMTS Turbo Decoder at 91.7MHz can be achieved using the proposed design.
{"title":"Application Specific Instruction Accelerator for Multistandard Viterbi and Turbo Decoding","authors":"Mangesh K. Kunchamwar, Durga P. Prasad, Pawan Hegde, P. Balsara, R. Sangireddy","doi":"10.1109/ICPPW.2010.17","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.17","url":null,"abstract":"There is an increasing demand for converged solution for multi-standard radio processors to support existing and future standards. In this work, heterogeneous multi-processor platform is proposed for multi standard wireless communication system which is programmable and scalable in adapting to future standards. Channel decoding algorithms form important constituent of wireless communication system because of their computational complexity. A programmable radio processor is proposed for channel decoding with application specific instruction accelerators. Viterbi and Turbo channel decoding algorithms are analyzed for computational parallelism in the algorithms and for hardware reusability across the algorithms. Application specific instruction accelerator is designed by exploiting similar characteristics and computational parallelism across the algorithms. The analysis shows that the throughput of 54Mbps for UWB Viterbi Decoder and 12 Mbps for UMTS Turbo Decoder at 91.7MHz can be achieved using the proposed design.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134175889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the evaluation of radix-2, radix-4 and radix-8 algorithms for N-point FFTs on a homogeneous Multi-Processor System-on-Chip, prototyped on FPGA device. The evaluation of the algorithms was done analysing profiling of the algorithms in comparison to a single processor architecture. The performance were evaluated in terms of required clock cycles, achieved speed-up and parallelization efficiency. The analysis showed for each algorithm how the parallelization efficiency grows moving from small to larger FFTs. Moreover the comparison between the different implementations showed the parallelization properties of each algorithm. Radix-2 algorithm shows the best speed-up and parallelization efficiency while radix-4 gives the best performance in terms of required clock cycles.
{"title":"FFT Algorithms Evaluation on a Homogeneous Multi-processor System-on-Chip","authors":"Roberto Airoldi, F. Garzia, J. Nurmi","doi":"10.1109/ICPPW.2010.20","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.20","url":null,"abstract":"This paper presents the evaluation of radix-2, radix-4 and radix-8 algorithms for N-point FFTs on a homogeneous Multi-Processor System-on-Chip, prototyped on FPGA device. The evaluation of the algorithms was done analysing profiling of the algorithms in comparison to a single processor architecture. The performance were evaluated in terms of required clock cycles, achieved speed-up and parallelization efficiency. The analysis showed for each algorithm how the parallelization efficiency grows moving from small to larger FFTs. Moreover the comparison between the different implementations showed the parallelization properties of each algorithm. Radix-2 algorithm shows the best speed-up and parallelization efficiency while radix-4 gives the best performance in terms of required clock cycles.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115703798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Penczek, S. Herhut, S. Scholz, A. Shafarenko, Jungsook Yang, Chun-Yi Chen, N. Bagherzadeh, C. Grelck
Development and implementation of the coordination language S-NET has been reported previously. In this paper we apply the S-NET design methodology to a computer graphics problem. We demonstrate (i) how a complete separation of concerns can be achieved between algorithm engineering and concurrency engineering and (ii) that the S-NET implementation is quite capable of achieving performance that matches what can be achieved using low-level tools such as MPI. We find this remarkable as under S-NET communication, concurrency and synchronization are completely separated from algorithmic code. We argue that our approach delivers a flexible component technology which liberates application developers from the logistics of task and data management while at the same time making it unnecessary for a distributed computing professional to acquire detailed knowledge of the application area.
{"title":"Message Driven Programming with S-Net: Methodology and Performance","authors":"F. Penczek, S. Herhut, S. Scholz, A. Shafarenko, Jungsook Yang, Chun-Yi Chen, N. Bagherzadeh, C. Grelck","doi":"10.1109/ICPPW.2010.61","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.61","url":null,"abstract":"Development and implementation of the coordination language S-NET has been reported previously. In this paper we apply the S-NET design methodology to a computer graphics problem. We demonstrate (i) how a complete separation of concerns can be achieved between algorithm engineering and concurrency engineering and (ii) that the S-NET implementation is quite capable of achieving performance that matches what can be achieved using low-level tools such as MPI. We find this remarkable as under S-NET communication, concurrency and synchronization are completely separated from algorithmic code. We argue that our approach delivers a flexible component technology which liberates application developers from the logistics of task and data management while at the same time making it unnecessary for a distributed computing professional to acquire detailed knowledge of the application area.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114851832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel storage systems have been highly scalable and widely used in support of data-intensive applications. In future systems with the nature of massive data processing and storing, hybrid storage systems opt for a solution to fulfill a variety of demands such as large storage capacity, high I/O performance and low cost. Hybrid storage systems (HSS) contain both high-end storage components (e.g. solid-state disks and hard disk drives) to guarantee performance, and low-end storage components (e.g. tapes) to reduce cost. In HSS, transferring data back and forth among solid-state disks (SSDs), hard disk drives (HDDs), and tapes plays a critical role in achieving high I/O performance. Prefetching is a promising solution to reduce the latency of data transferring in HSS. However, prefetching in the context of HSS is technically challenging due to an interesting dilemma: aggressive prefetching is required to efficiently reduce I/O latency, whereas overaggressive prefetching may waste I/O bandwidth by transferring useless data from HDDs to SSDs or from tapes to HDDs. To address this problem, we propose a multi-layer prefetching algorithm that can judiciously prefetch data from tapes to HDDs and from HDDs to SSDs. To evaluate our algorithm, we develop an analytical model and the experimental results reveal that our prefetching algorithm improves the performance in hybrid storage systems.
并行存储系统具有高度可扩展性,并广泛用于支持数据密集型应用程序。在未来具有海量数据处理和存储性质的系统中,混合存储系统选择满足大存储容量、高I/O性能和低成本等多种需求的解决方案。混合存储系统(HSS)既包含保证性能的高端存储组件(如固态磁盘和硬盘驱动器),也包含降低成本的低端存储组件(如磁带)。在HSS中,数据在ssd (solid-state disk)、hdd (hard disk drives)和磁带之间的来回传输对实现高I/O性能起着至关重要的作用。预取是降低HSS数据传输延迟的一种很有前途的解决方案。然而,HSS环境中的预取在技术上具有挑战性,因为存在一个有趣的难题:需要主动预取来有效地减少I/O延迟,而过度预取可能会将无用的数据从hdd传输到ssd或从磁带传输到hdd,从而浪费I/O带宽。为了解决这个问题,我们提出了一种多层预取算法,可以明智地从磁带预取数据到hdd,从hdd预取数据到ssd。为了评估我们的算法,我们建立了一个分析模型,实验结果表明我们的预取算法提高了混合存储系统的性能。
{"title":"Multi-layer Prefetching for Hybrid Storage Systems: Algorithms, Models, and Evaluations","authors":"Mais Nijim, Ziliang Zong, X. Qin, Y. Nijim","doi":"10.1109/ICPPW.2010.18","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.18","url":null,"abstract":"Parallel storage systems have been highly scalable and widely used in support of data-intensive applications. In future systems with the nature of massive data processing and storing, hybrid storage systems opt for a solution to fulfill a variety of demands such as large storage capacity, high I/O performance and low cost. Hybrid storage systems (HSS) contain both high-end storage components (e.g. solid-state disks and hard disk drives) to guarantee performance, and low-end storage components (e.g. tapes) to reduce cost. In HSS, transferring data back and forth among solid-state disks (SSDs), hard disk drives (HDDs), and tapes plays a critical role in achieving high I/O performance. Prefetching is a promising solution to reduce the latency of data transferring in HSS. However, prefetching in the context of HSS is technically challenging due to an interesting dilemma: aggressive prefetching is required to efficiently reduce I/O latency, whereas overaggressive prefetching may waste I/O bandwidth by transferring useless data from HDDs to SSDs or from tapes to HDDs. To address this problem, we propose a multi-layer prefetching algorithm that can judiciously prefetch data from tapes to HDDs and from HDDs to SSDs. To evaluate our algorithm, we develop an analytical model and the experimental results reveal that our prefetching algorithm improves the performance in hybrid storage systems.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122728952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although MPI is a de-facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. XcalableMP is a language extension of C and Fortran for parallel programming on distributed memory systems that helps users to reduce those programming efforts. XcalableMP provides two programming models. The first one is the global view model, which supports typical parallelization based on the data and task parallel paradigm, and enables parallelizing the original sequential code using minimal modification with simple, OpenMP-like directives. The other one is the local view model, which allows using CAF-like expressions to describe inter-node communication. Users can even use MPI and OpenMP explicitly in our language to optimize performance explicitly. In this paper, we introduce XcalableMP, the implementation of the compiler, and the performance evaluation result. For the performance evaluation, we parallelized HPCC Benchmark in XcalableMP. It shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code.
{"title":"Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems","authors":"Jinpil Lee, M. Sato","doi":"10.1109/ICPPW.2010.62","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.62","url":null,"abstract":"Although MPI is a de-facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. XcalableMP is a language extension of C and Fortran for parallel programming on distributed memory systems that helps users to reduce those programming efforts. XcalableMP provides two programming models. The first one is the global view model, which supports typical parallelization based on the data and task parallel paradigm, and enables parallelizing the original sequential code using minimal modification with simple, OpenMP-like directives. The other one is the local view model, which allows using CAF-like expressions to describe inter-node communication. Users can even use MPI and OpenMP explicitly in our language to optimize performance explicitly. In this paper, we introduce XcalableMP, the implementation of the compiler, and the performance evaluation result. For the performance evaluation, we parallelized HPCC Benchmark in XcalableMP. It shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129998195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}