Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663802
Warren Armstrong, Alistair P. Rendell
The field of reinforcement learning has developed techniques for choosing beneficial actions within a dynamic environment. Such techniques learn from experience and do not require teaching. This paper explores how reinforcement learning techniques might be used to determine efficient storage formats for sparse matrices. Three different storage formats are considered: coordinate, compressed sparse row, and blocked compressed sparse row. Which format performs best depends heavily on the nature of the matrix and the computer system being used. To test the above a program has been written to generate a series of sparse matrices, where any given matrix performs optimally using one of the three different storage types. For each matrix several sparse matrix vector products are performed. The goal of the learning agent is to predict the optimal sparse matrix storage format for that matrix. The proposed agent uses five attributes of the sparse matrix: the number of rows, the number of columns, the number of non-zero elements, the standard deviation of non-zeroes per row and the mean number of neighbours. The agent is characterized by two parameters: an exploration rate and a parameter that determines how the state space is partitioned. The ability of the agent to successfully predict the optimal storage format is analyzed for a series of 1,000 automatically generated test matrices.
{"title":"Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection","authors":"Warren Armstrong, Alistair P. Rendell","doi":"10.1109/CLUSTR.2008.4663802","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663802","url":null,"abstract":"The field of reinforcement learning has developed techniques for choosing beneficial actions within a dynamic environment. Such techniques learn from experience and do not require teaching. This paper explores how reinforcement learning techniques might be used to determine efficient storage formats for sparse matrices. Three different storage formats are considered: coordinate, compressed sparse row, and blocked compressed sparse row. Which format performs best depends heavily on the nature of the matrix and the computer system being used. To test the above a program has been written to generate a series of sparse matrices, where any given matrix performs optimally using one of the three different storage types. For each matrix several sparse matrix vector products are performed. The goal of the learning agent is to predict the optimal sparse matrix storage format for that matrix. The proposed agent uses five attributes of the sparse matrix: the number of rows, the number of columns, the number of non-zero elements, the standard deviation of non-zeroes per row and the mean number of neighbours. The agent is characterized by two parameters: an exploration rate and a parameter that determines how the state space is partitioned. The ability of the agent to successfully predict the optimal storage format is analyzed for a series of 1,000 automatically generated test matrices.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124077860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663771
Huan Chen, Jin Xiong, Ninghui Sun
In the medium and small cluster systems, the centralized file server such as NFS is the main approach to provide the storage service with low cost and easy management. However, when multiple parallel applications access the shared storage at the same time, the I/O performance decreases much because of the interference of the I/O requests coming from the different clients. In this paper, a hint-based I/O mechanism is proposed and implemented in the United-FS. By analyzing the hint information of the I/O requests, the related requests are grouped, sorted and scheduled by our hint-based I/O scheduler. The experiments show that our hint-based I/O mechanism nearly doubles the read performance compared with NFS, and has better scalability.
{"title":"A novel hint-based I/O mechanism for centralized file server of cluster","authors":"Huan Chen, Jin Xiong, Ninghui Sun","doi":"10.1109/CLUSTR.2008.4663771","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663771","url":null,"abstract":"In the medium and small cluster systems, the centralized file server such as NFS is the main approach to provide the storage service with low cost and easy management. However, when multiple parallel applications access the shared storage at the same time, the I/O performance decreases much because of the interference of the I/O requests coming from the different clients. In this paper, a hint-based I/O mechanism is proposed and implemented in the United-FS. By analyzing the hint information of the I/O requests, the related requests are grouped, sorted and scheduled by our hint-based I/O scheduler. The experiments show that our hint-based I/O mechanism nearly doubles the read performance compared with NFS, and has better scalability.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"7 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117341123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663774
T. Hoefler, A. Lumsdaine
Message progression schemes that enable communication and computation to be overlapped have the potential to improve the performance of parallel applications. With currently available high-performance networks there are several options for making progress: manual progression, use of a progress thread, and communication offload. In this paper we analyze threaded progression approaches, comparing the effects of using shared or dedicated CPU cores for progression. To perform these comparisons, we propose time-based and work-based benchmark schemes. As expected, threaded progression performs well when a spare core is available to be dedicated to communication progression, but a number of operating system effects prevent the same benefits from being obtained when communication progress must share a core with computation. We show that some limited performance improvement can be obtained in the shared-core case by real-time scheduling of the progress thread.
{"title":"Message progression in parallel computing - to thread or not to thread?","authors":"T. Hoefler, A. Lumsdaine","doi":"10.1109/CLUSTR.2008.4663774","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663774","url":null,"abstract":"Message progression schemes that enable communication and computation to be overlapped have the potential to improve the performance of parallel applications. With currently available high-performance networks there are several options for making progress: manual progression, use of a progress thread, and communication offload. In this paper we analyze threaded progression approaches, comparing the effects of using shared or dedicated CPU cores for progression. To perform these comparisons, we propose time-based and work-based benchmark schemes. As expected, threaded progression performs well when a spare core is available to be dedicated to communication progression, but a number of operating system effects prevent the same benefits from being obtained when communication progress must share a core with computation. We show that some limited performance improvement can be obtained in the shared-core case by real-time scheduling of the progress thread.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116046199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663768
Q. Wei, B. Veeravalli, Lingfang Zeng
Object-based storage is emerging as a next generation of distributed storage technology. Aiming at improving the performance and load-balancing of large-scale object-based storage system, we present a dynamic weight-based cooperative caching scheme referred to as DWC2, which allows an object-based storage device (OSD) to use the available free cache of the neighbouring OSD. Our proposed DWC2 replaces objects based on their weights which is a function of object size, popularity and replica number, and dynamically partitions the memory of OSD into local cache and remote cache according to activity workload. An object data is cached in local cache or remote cache of the cooperative OSDs, thus increasing cache hit ratio, reducing expensive disk access time as well as improving load balance. We benchmarked our proposed DWC2 with existing cooperative caching schemes under various OSD environments. Our rigorous experiment results conclusively demonstrate that our DWC2 is scalable and can achieve a significant cache hit ratio, average response time, and load balancing for large-scale OSD cluster.
{"title":"DWC2: A dynamic weight-based cooperative caching scheme for object-based storage cluster","authors":"Q. Wei, B. Veeravalli, Lingfang Zeng","doi":"10.1109/CLUSTR.2008.4663768","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663768","url":null,"abstract":"Object-based storage is emerging as a next generation of distributed storage technology. Aiming at improving the performance and load-balancing of large-scale object-based storage system, we present a dynamic weight-based cooperative caching scheme referred to as DWC2, which allows an object-based storage device (OSD) to use the available free cache of the neighbouring OSD. Our proposed DWC2 replaces objects based on their weights which is a function of object size, popularity and replica number, and dynamically partitions the memory of OSD into local cache and remote cache according to activity workload. An object data is cached in local cache or remote cache of the cooperative OSDs, thus increasing cache hit ratio, reducing expensive disk access time as well as improving load balance. We benchmarked our proposed DWC2 with existing cooperative caching schemes under various OSD environments. Our rigorous experiment results conclusively demonstrate that our DWC2 is scalable and can achieve a significant cache hit ratio, average response time, and load balancing for large-scale OSD cluster.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663776
Pradipta De, Ravina Kothari, V. Mann
Various studies have pointed out the debilitating effects of OS jitter on the performance of parallel applications on large clusters such as the ASCI Purple and the Mare Nostrum at Barcelona Supercomputing Center. These clusters use commodity OSes such as AIX and Linux respectively. The biggest hindrance in evaluating any technique to mitigate jitter is getting access to such large scale production HPC systems running a commodity OS. An earlier attempt aimed at solving this problem was to emulate the effects of OS jitter on more widely available and jitter-free systems such as BlueGene/L. In this paper, we point out the shortcomings of previous such approaches and present the design and implementation of an emulation framework that helps overcome those shortcomings by using innovative techniques. We collect jitter traces on a commodity OS with a given configuration, under which we want to study the scaling behavior. These traces are then replayed on a jitter-free system to predict scalability in presence of OS jitter. The application of this emulation framework to predict scalability is illustrated through a comparative scalability study of an off-the-shelf Linux distribution with a minimal configuration (runlevel 1) and a highly optimized embedded Linux distribution, running on the IO nodes of BlueGene/L. We validate the results of our emulation both on a single node as well as on a real cluster. Our results indicate that an optimized OS along with a technique to synchronize jitter can reduce the performance degradation due to jitter from 99% (in case of the off-the-shelf Linux without any synchronization) to a much more tolerable level of 6% (in case of highly optimized BlueGene/L IO node Linux with synchronization) at 2048 processors. Furthermore, perfect synchronization can give linear scaling with less than 1% slowdown, regardless of the type of OS used. However, as the jitter at different nodes starts getting desynchronized, even with a minor skew across nodes, the optimized OS starts outperforming the off-the-shelf OS.
{"title":"A trace-driven emulation framework to predict scalability of large clusters in presence of OS Jitter","authors":"Pradipta De, Ravina Kothari, V. Mann","doi":"10.1109/CLUSTR.2008.4663776","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663776","url":null,"abstract":"Various studies have pointed out the debilitating effects of OS jitter on the performance of parallel applications on large clusters such as the ASCI Purple and the Mare Nostrum at Barcelona Supercomputing Center. These clusters use commodity OSes such as AIX and Linux respectively. The biggest hindrance in evaluating any technique to mitigate jitter is getting access to such large scale production HPC systems running a commodity OS. An earlier attempt aimed at solving this problem was to emulate the effects of OS jitter on more widely available and jitter-free systems such as BlueGene/L. In this paper, we point out the shortcomings of previous such approaches and present the design and implementation of an emulation framework that helps overcome those shortcomings by using innovative techniques. We collect jitter traces on a commodity OS with a given configuration, under which we want to study the scaling behavior. These traces are then replayed on a jitter-free system to predict scalability in presence of OS jitter. The application of this emulation framework to predict scalability is illustrated through a comparative scalability study of an off-the-shelf Linux distribution with a minimal configuration (runlevel 1) and a highly optimized embedded Linux distribution, running on the IO nodes of BlueGene/L. We validate the results of our emulation both on a single node as well as on a real cluster. Our results indicate that an optimized OS along with a technique to synchronize jitter can reduce the performance degradation due to jitter from 99% (in case of the off-the-shelf Linux without any synchronization) to a much more tolerable level of 6% (in case of highly optimized BlueGene/L IO node Linux with synchronization) at 2048 processors. Furthermore, perfect synchronization can give linear scaling with less than 1% slowdown, regardless of the type of OS used. However, as the jitter at different nodes starts getting desynchronized, even with a minor skew across nodes, the optimized OS starts outperforming the off-the-shelf OS.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129417469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663764
K. Underwood, M. Levenhagen, K. Hemmert, R. Brightwell
Remote atomic memory operations are critical for achieving high-performance synchronization in tightly-coupled systems. Previous approaches to implementing atomic memory operations on high-performance networks have explored providing the primitives necessary to achieve low latency and low host processor overhead. In this paper, we explore the implementation of atomic memory operations with a focus on achieving high message rate. We believe that high message rate is a key performance characteristic that will determine the viability of a high-performance network to support future multi-petascale systems, especially those that expect to employ a partitioned global address space (PGAS) programming model. As an example, many have proposed using network interface level atomic operations to enhance the performance of the HPCC RandomAccess benchmark. This paper explores several issues relevant to the design of an atomic unit on the network interface. We explore the implications of the size of the cache as well as the associativity. Given the growing ratio of bandwidth to latency of modern host interfaces, we explore some of the interactions that impact the concurrency needed to saturate the interface.
{"title":"High message rate, NIC-based atomics: Design and performance considerations","authors":"K. Underwood, M. Levenhagen, K. Hemmert, R. Brightwell","doi":"10.1109/CLUSTR.2008.4663764","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663764","url":null,"abstract":"Remote atomic memory operations are critical for achieving high-performance synchronization in tightly-coupled systems. Previous approaches to implementing atomic memory operations on high-performance networks have explored providing the primitives necessary to achieve low latency and low host processor overhead. In this paper, we explore the implementation of atomic memory operations with a focus on achieving high message rate. We believe that high message rate is a key performance characteristic that will determine the viability of a high-performance network to support future multi-petascale systems, especially those that expect to employ a partitioned global address space (PGAS) programming model. As an example, many have proposed using network interface level atomic operations to enhance the performance of the HPCC RandomAccess benchmark. This paper explores several issues relevant to the design of an atomic unit on the network interface. We explore the implications of the size of the cache as well as the associativity. Given the growing ratio of bandwidth to latency of modern host interfaces, we explore some of the interactions that impact the concurrency needed to saturate the interface.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663795
Hideaki Kimura, M. Sato, Takayuki Imada, Y. Hotta
Recently, several energy reduction techniques using DVFS have been presented for PC clusters. This work proposes a Code-instrumented Runtime DVFS control, in which the combination of frequency and voltage (called a gear) is managed at the instrumented code at runtime. The codes are inserted by defining the program regions that have the same characteristics. The Code-instrumented Runtime DVFS control method is better than the Interrupt-based Runtime DVFS control method, in which the gear is managed by periodic interrupt, because it can reflect the program information to control DVFS. Though Static DVFS control, which makes use of the power profile before execution, gives better energy reduction, the proposed Code-instrumented Runtime DVFS control is easier to use, because it requires no information such as profile. The proposed DVFS control method was designed and implemented. The beta-adaptation was used as the runtime algorithm to choose the appropriate gear. The results show that the proposed method can improve the performance and energy consumption compared with Interrupt-based Runtime DVFS control. Although our Code-instrumented Runtime DVFS control can select lower voltages and frequencies than the present Runtime DVFS control given a certain deadline, unfortunately, it was also found to increase power consumption of the PC cluster due to an increase in the execution time.
{"title":"Runtime DVFS control with instrumented Code in power-scalable cluster system","authors":"Hideaki Kimura, M. Sato, Takayuki Imada, Y. Hotta","doi":"10.1109/CLUSTR.2008.4663795","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663795","url":null,"abstract":"Recently, several energy reduction techniques using DVFS have been presented for PC clusters. This work proposes a Code-instrumented Runtime DVFS control, in which the combination of frequency and voltage (called a gear) is managed at the instrumented code at runtime. The codes are inserted by defining the program regions that have the same characteristics. The Code-instrumented Runtime DVFS control method is better than the Interrupt-based Runtime DVFS control method, in which the gear is managed by periodic interrupt, because it can reflect the program information to control DVFS. Though Static DVFS control, which makes use of the power profile before execution, gives better energy reduction, the proposed Code-instrumented Runtime DVFS control is easier to use, because it requires no information such as profile. The proposed DVFS control method was designed and implemented. The beta-adaptation was used as the runtime algorithm to choose the appropriate gear. The results show that the proposed method can improve the performance and energy consumption compared with Interrupt-based Runtime DVFS control. Although our Code-instrumented Runtime DVFS control can select lower voltages and frequencies than the present Runtime DVFS control given a certain deadline, unfortunately, it was also found to increase power consumption of the PC cluster due to an increase in the execution time.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132447013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663777
Marc Casas, Rosa M. Badia, Jesús Labarta
Scalability and performance of applications is a very important issue today. As more complex have become high performance architectures, it is more complex to predict the behavior of a given application running on them. In this paper, we propose a methodology which automatically and quickly predicts, from a very limited number of runs using very few processors, the scalability and performance of a given application in a wide range of supercomputers taking into account details of the architecture and the network of the machines.
{"title":"Prediction of behavior of MPI applications","authors":"Marc Casas, Rosa M. Badia, Jesús Labarta","doi":"10.1109/CLUSTR.2008.4663777","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663777","url":null,"abstract":"Scalability and performance of applications is a very important issue today. As more complex have become high performance architectures, it is more complex to predict the behavior of a given application running on them. In this paper, we propose a methodology which automatically and quickly predicts, from a very limited number of runs using very few processors, the scalability and performance of a given application in a wide range of supercomputers taking into account details of the architecture and the network of the machines.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133488982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663757
A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller
Parallel performance monitoring extends parallel measurement systems with infrastructure and interfaces for online performance data access, communication, and analysis. At the same time it raises concerns for the impact on application execution from monitor overhead. The application monitoring scheme parameterized by performance events to monitor, access frequency and the type of data analysis operation defines a set of monitoring requirements. The monitoring infrastructure presents its own choices, particularly the amount and configuration of resources devoted explicitly to monitoring. The key to scalable, low-overhead parallel performance monitoring is to match the application monitoring demands to the effective operating range of the monitoring system (or vice-versa). A poor match can result in over-provisioning (wasted resources) or in under-provisioning (lack of scalability, high overheads and poor quality of performance data). We present a methodology and evaluation framework to determine the sweet-spots for performance monitoring using TAU and MRNet.
{"title":"In search of sweet-spots in parallel performance monitoring","authors":"A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller","doi":"10.1109/CLUSTR.2008.4663757","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663757","url":null,"abstract":"Parallel performance monitoring extends parallel measurement systems with infrastructure and interfaces for online performance data access, communication, and analysis. At the same time it raises concerns for the impact on application execution from monitor overhead. The application monitoring scheme parameterized by performance events to monitor, access frequency and the type of data analysis operation defines a set of monitoring requirements. The monitoring infrastructure presents its own choices, particularly the amount and configuration of resources devoted explicitly to monitoring. The key to scalable, low-overhead parallel performance monitoring is to match the application monitoring demands to the effective operating range of the monitoring system (or vice-versa). A poor match can result in over-provisioning (wasted resources) or in under-provisioning (lack of scalability, high overheads and poor quality of performance data). We present a methodology and evaluation framework to determine the sweet-spots for performance monitoring using TAU and MRNet.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133505792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663754
Takeshi Amako, Yusaku Yamamoto, Shaoliang Zhang
The nonlinear eigenvalue problem plays an important role in various fields such as nonlinear elasticity, electronic structure calculation and theoretical fluid dynamics. We recently proposed a new algorithm for the nonlinear eigenvalue problem, which reduces the original problem to a smaller generalized linear eigenvalue problem with Hankel coefficient matrices through complex contour integral. This algorithm has a unique feature that it can find all the eigenvalues in a closed curve on the complex plane. Moreover, it has large-grain parallelism and is suited for execution in a grid environment. In this paper, we study the numerical properties of our algorithm theoretically. In particular, we analyze the effect of numerical integration to the computed eigenvalues and give a guideline on how to choose the size of the Hankel matrices properly. Also, we show the parallel performance of our algorithm implemented on a PC cluster using OmniRPC, a grid RPC system. Parallel efficiency of 75% is achieved when solving a nonlinear eigenvalue problem of order 1000 using 14 processors.
{"title":"A large-grained parallel algorithm for nonlinear eigenvalue problems and its implementation using OmniRPC","authors":"Takeshi Amako, Yusaku Yamamoto, Shaoliang Zhang","doi":"10.1109/CLUSTR.2008.4663754","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663754","url":null,"abstract":"The nonlinear eigenvalue problem plays an important role in various fields such as nonlinear elasticity, electronic structure calculation and theoretical fluid dynamics. We recently proposed a new algorithm for the nonlinear eigenvalue problem, which reduces the original problem to a smaller generalized linear eigenvalue problem with Hankel coefficient matrices through complex contour integral. This algorithm has a unique feature that it can find all the eigenvalues in a closed curve on the complex plane. Moreover, it has large-grain parallelism and is suited for execution in a grid environment. In this paper, we study the numerical properties of our algorithm theoretically. In particular, we analyze the effect of numerical integration to the computed eigenvalues and give a guideline on how to choose the size of the Hankel matrices properly. Also, we show the parallel performance of our algorithm implemented on a PC cluster using OmniRPC, a grid RPC system. Parallel efficiency of 75% is achieved when solving a nonlinear eigenvalue problem of order 1000 using 14 processors.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131370676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}