The current trend in HPC hardware is towards clusters of shared-memory (SMP) compute nodes. For applications developers the major question is how best to program these SMP clusters. To address this we study an algorithm from Discrete Element Modeling, parallelised using both the message-passing and shared-memory models simultaneously ("hybrid" parallelisation). The natural load-balancing methods are different in the two parallel models, the shared-memory method being in principle more efficient for very load-imbalanced problems. It is therefore possible that hybrid parallelism will be beneficial on SMP clusters. We benchmark MPI and OpenMP implementations of the algorithm on MPP, SMP and cluster architectures, and evaluate the effectiveness of hybrid parallelism. Although we observe cases where OpenMP is more efficient than MPI on a single SMP node, we conclude that our current OpenMP implementation is not yet efficient enough for hybrid parallelism to outperform pure message-passing on an SMP cluster.
{"title":"Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling","authors":"D. Henty","doi":"10.1109/SC.2000.10005","DOIUrl":"https://doi.org/10.1109/SC.2000.10005","url":null,"abstract":"The current trend in HPC hardware is towards clusters of shared-memory (SMP) compute nodes. For applications developers the major question is how best to program these SMP clusters. To address this we study an algorithm from Discrete Element Modeling, parallelised using both the message-passing and shared-memory models simultaneously (\"hybrid\" parallelisation). The natural load-balancing methods are different in the two parallel models, the shared-memory method being in principle more efficient for very load-imbalanced problems. It is therefore possible that hybrid parallelism will be beneficial on SMP clusters. We benchmark MPI and OpenMP implementations of the algorithm on MPP, SMP and cluster architectures, and evaluate the effectiveness of hybrid parallelism. Although we observe cases where OpenMP is more efficient than MPI on a single SMP node, we conclude that our current OpenMP implementation is not yet efficient enough for hybrid parallelism to outperform pure message-passing on an SMP cluster.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122492041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Wu, Anthony Bolmarcich, M. Snir, David Wootton, F. Parpia, Anthony Chan, E. Lusk, W. Gropp
In this paper we describe a trace analysis framework, from trace generation to visualization. It includes a unified tracing facility on IBMâ SPä systems, a self-defining interval file format, an API for framework extensions, utilities for merging and statistics generation, and a visualization tool with preview and multiple time-space diagrams. The trace environment is extremely scalable, and combines MPI events with system activities in the same set of trace files, one for each SMP node. Since the amount of trace data may be very large, utilities are developed to convert and merge individual trace files into a self-defining interval trace file with multiple frame directories. The interval format allows the development of multiple time-space diagrams, such as thread-activity view, processor-activity view, etc., from the same interval file. A visualization tool, Jumpshot, is modified to visualize these views. A statistics utility is developed using the API, along with its graphics viewer.
{"title":"From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems","authors":"C. Wu, Anthony Bolmarcich, M. Snir, David Wootton, F. Parpia, Anthony Chan, E. Lusk, W. Gropp","doi":"10.1109/SC.2000.10050","DOIUrl":"https://doi.org/10.1109/SC.2000.10050","url":null,"abstract":"In this paper we describe a trace analysis framework, from trace generation to visualization. It includes a unified tracing facility on IBMâ SPä systems, a self-defining interval file format, an API for framework extensions, utilities for merging and statistics generation, and a visualization tool with preview and multiple time-space diagrams. The trace environment is extremely scalable, and combines MPI events with system activities in the same set of trace files, one for each SMP node. Since the amount of trace data may be very large, utilities are developed to convert and merge individual trace files into a self-defining interval trace file with multiple frame directories. The interval format allows the development of multiple time-space diagrams, such as thread-activity view, processor-activity view, etc., from the same interval file. A visualization tool, Jumpshot, is modified to visualize these views. A statistics utility is developed using the API, along with its graphics viewer.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132932234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artificial neural networks with millions of adjustable parameters and a similar number of training examples are a potential solution for difficult, large-scale pattern recognition problems in areas such as speech and face recognition, classification of large volumes of web data, and finance. The bottleneck is that neural network training involves iterative gradient descent and is extremely computationally intensive. In this paper we present a technique for distributed training of Ultra Large Scale Neural Networks 1 (ULSNN) on Bunyip, a Linux-based cluster of 196 Pentium III processors. To illustrate ULSNN training we describe an experiment in which a neural network with 1.73 million adjustable parameters was trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training runs with a average performance of 163.3 GFlops/s (single precision). With a machine cost of $150,913, this yields a price/performance ratio of 92.4¢ /MFlops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM produces a sustained performance of 70 MFlops/s or $2.16 / MFlop/s (double precision).
{"title":"92¢ /MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster","authors":"Douglas Aberdeen, Jonathan Baxter, R. Edwards","doi":"10.1109/SC.2000.10031","DOIUrl":"https://doi.org/10.1109/SC.2000.10031","url":null,"abstract":"Artificial neural networks with millions of adjustable parameters and a similar number of training examples are a potential solution for difficult, large-scale pattern recognition problems in areas such as speech and face recognition, classification of large volumes of web data, and finance. The bottleneck is that neural network training involves iterative gradient descent and is extremely computationally intensive. In this paper we present a technique for distributed training of Ultra Large Scale Neural Networks 1 (ULSNN) on Bunyip, a Linux-based cluster of 196 Pentium III processors. To illustrate ULSNN training we describe an experiment in which a neural network with 1.73 million adjustable parameters was trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training runs with a average performance of 163.3 GFlops/s (single precision). With a machine cost of $150,913, this yields a price/performance ratio of 92.4¢ /MFlops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM produces a sustained performance of 70 MFlops/s or $2.16 / MFlop/s (double precision).","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133582966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Govindaraju, Aleksander Slominski, Venkatesh Choppella, R. Bramley, Dennis Gannon
Distributed software component architectures provide promising approach to the problem of building large scale, scientific Grid applications [18]. Communication in these component architectures is based on Remote Method Invocation (RMI) protocols that allow one software component to invoke the functionality of another. Examples include Java remote method invocation (Java RMI)[25] and the new Simple Object Access Protocol (SOAP) [15]. SOAP has the advantage that many programming languages and component frameworks can support it. This paper describes experiments showing that SOAP by itself is not efficient enough for large scale scientific applications. However, when it is embedded in multi-protocol RMI framework, SOAP can be effectively used as a universal control protocol, that can be swapped out by faster, more special purpose protocols when large data transfer speeds are needed.
{"title":"Requirements for and Evaluation of RMI Protocols for Scientific Computing","authors":"M. Govindaraju, Aleksander Slominski, Venkatesh Choppella, R. Bramley, Dennis Gannon","doi":"10.1109/SC.2000.10060","DOIUrl":"https://doi.org/10.1109/SC.2000.10060","url":null,"abstract":"Distributed software component architectures provide promising approach to the problem of building large scale, scientific Grid applications [18]. Communication in these component architectures is based on Remote Method Invocation (RMI) protocols that allow one software component to invoke the functionality of another. Examples include Java remote method invocation (Java RMI)[25] and the new Simple Object Access Protocol (SOAP) [15]. SOAP has the advantage that many programming languages and component frameworks can support it. This paper describes experiments showing that SOAP by itself is not efficient enough for large scale scientific applications. However, when it is embedded in multi-protocol RMI framework, SOAP can be effectively used as a universal control protocol, that can be swapped out by faster, more special purpose protocols when large data transfer speeds are needed.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134301849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel programming languages: Co-Array Fortran, High Performance Fortran, Single Assignment C, and ZPL. We evaluate each language in terms of its portability, its performance, and its ability to express the algorithm clearly and concisely. Experimental platforms include the Cray T3E, IBM SP, SGI Origin, Sun Enterprise 5500, and a high-performance Linux cluster. Our findings indicate that while it is possible to achieve good portability, performance, and expressiveness, most languages currently fall short in at least one of these areas. We find a strong correlation between expressiveness and a language’s support for a global view of computation, and we identify key factors for achieving portable performance in multigrid applications.
{"title":"A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures","authors":"B. Chamberlain, Steven J. Deitz, L. Snyder","doi":"10.1109/SC.2000.10006","DOIUrl":"https://doi.org/10.1109/SC.2000.10006","url":null,"abstract":"Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel programming languages: Co-Array Fortran, High Performance Fortran, Single Assignment C, and ZPL. We evaluate each language in terms of its portability, its performance, and its ability to express the algorithm clearly and concisely. Experimental platforms include the Cray T3E, IBM SP, SGI Origin, Sun Enterprise 5500, and a high-performance Linux cluster. Our findings indicate that while it is possible to achieve good portability, performance, and expressiveness, most languages currently fall short in at least one of these areas. We find a strong correlation between expressiveness and a language’s support for a global view of computation, and we identify key factors for achieving portable performance in multigrid applications.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Casanova, Graziano Obertelli, F. Berman, R. Wolski
The Computational Grid is a promising platform for the efficient execution of parameter sweep applications over large parameter spaces. To achieve performance on the Grid, such applications must be scheduled so that shared data files are strategically placed to maximize reuse, and so that the application execution can adapt to the deliverable performance potential of target heterogeneous, distributed and shared resources. Parameter sweep applications are an important class of applications and would greatly benefit from the development of Grid middleware that embeds a scheduler for performance and targets Grid resources transparently. In this paper we describe a user-level Grid middleware project, the AppLeS Parameter Sweep Template (APST), that uses application-level scheduling techniques [1] and various Grid technologies to allow the efficient deployment of parameter sweep applications over the Grid. We discuss several possible scheduling algorithms and detail our software design. We then describe our current implementation of APST using systems like Globus [2], NetSolve [3] and the Network Weather Service [4], and present experimental results.
{"title":"The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid","authors":"H. Casanova, Graziano Obertelli, F. Berman, R. Wolski","doi":"10.1155/2000/319291","DOIUrl":"https://doi.org/10.1155/2000/319291","url":null,"abstract":"The Computational Grid is a promising platform for the efficient execution of parameter sweep applications over large parameter spaces. To achieve performance on the Grid, such applications must be scheduled so that shared data files are strategically placed to maximize reuse, and so that the application execution can adapt to the deliverable performance potential of target heterogeneous, distributed and shared resources. Parameter sweep applications are an important class of applications and would greatly benefit from the development of Grid middleware that embeds a scheduler for performance and targets Grid resources transparently. In this paper we describe a user-level Grid middleware project, the AppLeS Parameter Sweep Template (APST), that uses application-level scheduling techniques [1] and various Grid technologies to allow the efficient deployment of parameter sweep applications over the Grid. We discuss several possible scheduling algorithms and detail our software design. We then describe our current implementation of APST using systems like Globus [2], NetSolve [3] and the Network Weather Service [4], and present experimental results.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121472117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Antaki, G. Blelloch, O. Ghattas, Ivan Malcevic, G. Miller, N. Walkington
Many important phenomena in science and engineering, including our motivating problem of microstructural blood flow, can be modeled as flows with dynamic interfaces. The major challenge faced in simulating such flows is resolving the interfacial motion. Lagrangian methods are ideally suited for such problems, since interfaces are naturally represented and propagated. However, the material description of motion results in dynamic meshes, which become hopelessly distorted unless they are regularly regenerated. Lagrangian methods are particularly challenging on parallel computers, because scalable dynamic mesh methods remain elusive. Here, we present a parallel dynamic mesh Lagrangian method for flows with dynamic interfaces. We take an aggressive approach to dynamic meshing by triangulating the propagating grid points at every timestep using a scalable parallel Delaunay algorithm. Contrary to conventional wisdom, we show that the costs of the geometric components (triangulation, coarsening, refinement, and partitioning) can be made small relative to the flow solver.
{"title":"A PARALLEL DYNAMIC-MESH LAGRANGIAN METHOD FOR SIMULATION OF FLOWS WITH DYNAMIC INTERFACES","authors":"J. Antaki, G. Blelloch, O. Ghattas, Ivan Malcevic, G. Miller, N. Walkington","doi":"10.1109/SC.2000.10045","DOIUrl":"https://doi.org/10.1109/SC.2000.10045","url":null,"abstract":"Many important phenomena in science and engineering, including our motivating problem of microstructural blood flow, can be modeled as flows with dynamic interfaces. The major challenge faced in simulating such flows is resolving the interfacial motion. Lagrangian methods are ideally suited for such problems, since interfaces are naturally represented and propagated. However, the material description of motion results in dynamic meshes, which become hopelessly distorted unless they are regularly regenerated. Lagrangian methods are particularly challenging on parallel computers, because scalable dynamic mesh methods remain elusive. Here, we present a parallel dynamic mesh Lagrangian method for flows with dynamic interfaces. We take an aggressive approach to dynamic meshing by triangulating the propagating grid points at every timestep using a scalable parallel Delaunay algorithm. Contrary to conventional wisdom, we show that the costs of the geometric components (triangulation, coarsening, refinement, and partitioning) can be made small relative to the flow solver.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117150137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nawaaz Ahmed, N. Mateev, K. Pingali, Paul V. Stodghill
We present compiler technology for synthesizing sparse matrix code from (i) dense matrix code, and (ii) a description of the index structure of a sparse matrix. Our approach is to embed statement instances into a Cartesian product of statement iteration and data spaces, and to produce efficient sparse code by identifying common enumerations for multiple references to sparse matrices. The approach works for imperfectly-nested codes with dependences, and produces sparse code competitive with hand-written library code for the Basic Linear Algebra Subroutines (BLAS).
{"title":"A Framework for Sparse Matrix Code Synthesis from High-level Specifications","authors":"Nawaaz Ahmed, N. Mateev, K. Pingali, Paul V. Stodghill","doi":"10.1109/SC.2000.10033","DOIUrl":"https://doi.org/10.1109/SC.2000.10033","url":null,"abstract":"We present compiler technology for synthesizing sparse matrix code from (i) dense matrix code, and (ii) a description of the index structure of a sparse matrix. Our approach is to embed statement instances into a Cartesian product of statement iteration and data spaces, and to produce efficient sparse code by identifying common enumerations for multiple references to sparse matrices. The approach works for imperfectly-nested codes with dependences, and produces sparse code competitive with hand-written library code for the Basic Linear Algebra Subroutines (BLAS).","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129658402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transmission Control Protocol (TCP) is used by various applications to achieve reliable data transfer. TCP was originally designed for unreliable networks. With the emergence of high-speed wide area networks various improvements have been applied to TCP to reduce latency and achieve improved bandwidth. The improvement is achieved by having system administrators tune the network and can take a considerable amount of time. This paper introduces PSockets (Parallel Sockets), a library that achieves an equivalent performance without manual tuning. The basic idea behind PSockets is to exploit network striping. By network striping we mean striping partitioned data across several open sockets. We describe experimental studies using PSockets over the Abilene network. We show in particular that network striping using PSockets is effective for high performance data intensive computing applications using geographically distributed data.
{"title":"PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks","authors":"H. Sivakumar, S. Bailey, R. Grossman","doi":"10.1109/SC.2000.10040","DOIUrl":"https://doi.org/10.1109/SC.2000.10040","url":null,"abstract":"Transmission Control Protocol (TCP) is used by various applications to achieve reliable data transfer. TCP was originally designed for unreliable networks. With the emergence of high-speed wide area networks various improvements have been applied to TCP to reduce latency and achieve improved bandwidth. The improvement is achieved by having system administrators tune the network and can take a considerable amount of time. This paper introduces PSockets (Parallel Sockets), a library that achieves an equivalent performance without manual tuning. The basic idea behind PSockets is to exploit network striping. By network striping we mean striping partitioned data across several open sockets. We describe experimental studies using PSockets over the Abilene network. We show in particular that network striping using PSockets is effective for high performance data intensive computing applications using geographically distributed data.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117212236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hauser, T. Mattox, R. LeBeau, H. Dietz, P. Huang
Direct numerical simulation of the Navier-Stokes equations (DNS) is an important technique for the future of computational fluid dynamics (CFD) in engineering applications. However, DNS requires massive computing resources. This paper presents a new approach for implementing high-cost DNS CFD using low-cost cluster hardware. After describing the DNS CFD code DNSTool, the paper focuses on the techniques and tools that we have developed to customize the performance of a cluster implementation of this application. This tuning of system performance involves both recoding of the application and careful engineering of the cluster design. Using the cluster KLAT2 (Kentucky Linux Athlon Testbed 2), while DNSTool cannot match the $0.64 per MFLOPS that KLAT2 achieves on single precision ScaLAPACK, it is very efficient; DNSTool on KLAT2 achieves price/performance of $2.75 per MFLOPS double precision and $1.86 single precision. Further, the code and tools are all, or will soon be, made freely available as full source code.
{"title":"High-Cost CFD on a Low-Cost Cluster","authors":"T. Hauser, T. Mattox, R. LeBeau, H. Dietz, P. Huang","doi":"10.1109/SC.2000.10020","DOIUrl":"https://doi.org/10.1109/SC.2000.10020","url":null,"abstract":"Direct numerical simulation of the Navier-Stokes equations (DNS) is an important technique for the future of computational fluid dynamics (CFD) in engineering applications. However, DNS requires massive computing resources. This paper presents a new approach for implementing high-cost DNS CFD using low-cost cluster hardware. After describing the DNS CFD code DNSTool, the paper focuses on the techniques and tools that we have developed to customize the performance of a cluster implementation of this application. This tuning of system performance involves both recoding of the application and careful engineering of the cluster design. Using the cluster KLAT2 (Kentucky Linux Athlon Testbed 2), while DNSTool cannot match the $0.64 per MFLOPS that KLAT2 achieves on single precision ScaLAPACK, it is very efficient; DNSTool on KLAT2 achieves price/performance of $2.75 per MFLOPS double precision and $1.86 single precision. Further, the code and tools are all, or will soon be, made freely available as full source code.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115753253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}