A decade ago, high-performance computing (HPC) was about to "come of age" and we were convinced it would have significant impact throughout the computing industry. Instead, the HPC community has remained small and elitist. The rate at which technical applications have been ported to parallel and distributed platforms is distressingly slow, given that the availability of key applications is precisely the mechanism needed to drive the growth of the community. When major software vendors state publicly that their products will never be parallelized - as some have in recent months - it's time for us to take a hard look at reality. Marketing and PR claims to the contrary, HPC is not a success story. Although our capabilities continue to expand, we have not found a way to make HPC improve our productivity.
{"title":"The Emperor Has No Clothes: What HPC Users Need to Say and HPC Vendors Need to Hear","authors":"C. Pancake","doi":"10.1145/224170.224172","DOIUrl":"https://doi.org/10.1145/224170.224172","url":null,"abstract":"A decade ago, high-performance computing (HPC) was about to \"come of age\" and we were convinced it would have significant impact throughout the computing industry. Instead, the HPC community has remained small and elitist. The rate at which technical applications have been ported to parallel and distributed platforms is distressingly slow, given that the availability of key applications is precisely the mechanism needed to drive the growth of the community. When major software vendors state publicly that their products will never be parallelized - as some have in recent months - it's time for us to take a hard look at reality. Marketing and PR claims to the contrary, HPC is not a success story. Although our capabilities continue to expand, we have not found a way to make HPC improve our productivity.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133097937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent research has offered programmers increased options for programming parallel computers by exposing system policies (e.g., memory coherence protocols) or by providing several programming paradigms (e.g. message passing and shared memory) on the same platform. Increased flexibility can lead to higher performance, but it is also a double-edged sword that demands a programmer understand his or her application and system at a more fundamental level. Our system, Tempest, allows a programmer to select or implement communication and memory coherence policies that fit an application's communication patterns. With it, we have achieved substantial performance gains without making major changes in programs. However, the process of selecting, designing, and implementing coherence protocols is difficult and time consuming, without tools to supply detailed information about an application's behavior and interaction with the memory system. StormWatch is a new visualization tool that aids a programmer through four mechanisms: tightly-coupled bidirectionally linked views, interactive filters, animation, and performance slicing. Multiple views present several aspects of program behavior simultaneously and show the same phenomenon from different perspectives. Real-time linking between views enables a programmer to explore levels of abstraction by changing a view and observing the effect on other views. Interactive filters, along with bidirectional linking, can isolate the effects of statements, loops, procedures, or files. StormWatch can also animate a program's dynamic behavior to show the evolution of program execution and communication. Finally, performance slicing captures causality among events. The examples in the paper illustrate how StormWatch helped us substantially improve the performance of two applications.
{"title":"Storm Watch: A Tool for Visualizing Memory System Protocols","authors":"Trishul M. Chilimbi, T. Ball, S. Eick, J. Larus","doi":"10.1145/224170.224287","DOIUrl":"https://doi.org/10.1145/224170.224287","url":null,"abstract":"Recent research has offered programmers increased options for programming parallel computers by exposing system policies (e.g., memory coherence protocols) or by providing several programming paradigms (e.g. message passing and shared memory) on the same platform. Increased flexibility can lead to higher performance, but it is also a double-edged sword that demands a programmer understand his or her application and system at a more fundamental level. Our system, Tempest, allows a programmer to select or implement communication and memory coherence policies that fit an application's communication patterns. With it, we have achieved substantial performance gains without making major changes in programs. However, the process of selecting, designing, and implementing coherence protocols is difficult and time consuming, without tools to supply detailed information about an application's behavior and interaction with the memory system. StormWatch is a new visualization tool that aids a programmer through four mechanisms: tightly-coupled bidirectionally linked views, interactive filters, animation, and performance slicing. Multiple views present several aspects of program behavior simultaneously and show the same phenomenon from different perspectives. Real-time linking between views enables a programmer to explore levels of abstraction by changing a view and observing the effect on other views. Interactive filters, along with bidirectional linking, can isolate the effects of statements, loops, procedures, or files. StormWatch can also animate a program's dynamic behavior to show the evolution of program execution and communication. Finally, performance slicing captures causality among events. The examples in the paper illustrate how StormWatch helped us substantially improve the performance of two applications.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133231841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a HIPPI-SONET Gateway which has been designed by members of the Computer Network Engineering Group at Los Alamos National Laboratory. The Gateway has been used in the CASA Gigabit Testbed at Caltech, Los Alamos National Laboratory, and the San Diego Supercomputer Center to provide communications between the sites. This paper will also make some qualitative statements as to lessons learned during the deployment and maintenance of this wide area network. We report record throughput for transmission of data across a wide area network. We have sustained data rates using the TCP/IP protocol of 550 Mbits/second and the rate of 792 Mbits/second for raw HIPPI data transfer over the 2,000 kilometers from the San Diego Supercomputer Center to the Los Alamos National Laboratory.
{"title":"Wide-Area Gigabit Networking: Los Alamos HIPPI-SONET Gateway","authors":"W. S. John, D. DuBois","doi":"10.1145/224170.224313","DOIUrl":"https://doi.org/10.1145/224170.224313","url":null,"abstract":"This paper describes a HIPPI-SONET Gateway which has been designed by members of the Computer Network Engineering Group at Los Alamos National Laboratory. The Gateway has been used in the CASA Gigabit Testbed at Caltech, Los Alamos National Laboratory, and the San Diego Supercomputer Center to provide communications between the sites. This paper will also make some qualitative statements as to lessons learned during the deployment and maintenance of this wide area network. We report record throughput for transmission of data across a wide area network. We have sustained data rates using the TCP/IP protocol of 550 Mbits/second and the rate of 792 Mbits/second for raw HIPPI data transfer over the 2,000 kilometers from the San Diego Supercomputer Center to the Los Alamos National Laboratory.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122910470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses the comprehensive performance profiling, improvement and benchmarking of a Computational Fluid Dynamics code, one of the Grand Challenge applications, on three popular multiprocessors. In the process of analyzing performance we considered language, compiler, architecture, and algorithmic changes and quantified each of them and their incremental contribution to bottom-line performance. We demonstrate that parallelization alone cannot result in significant gains if the granularity of parallel threads and the effect of parallelization on data locality are not taken into account. Unlike benchmarking studies that often focus on the performance or effectiveness of parallelizing compilers on specific loop kernels, we used the entire CFD code to measure the global effectiveness of compilers and parallel architectures. We probed the performance bottlenecks in each case and derived solutions which eliminate or neutralize the performance inhibiting factors. The major conclusion of our work is that overall performance is extremely sensitive to the synergetic effects of compiler optimizations, algorithmic and code tuning, and architectural idiosyncrasies.
{"title":"The Synergetic Effect of Compiler, Architecture, and Manual Optimizations on the Performance of CFD on Multiprocessors","authors":"M. Kuba, C. Polychronopoulos, K. Gallivan","doi":"10.1145/224170.224426","DOIUrl":"https://doi.org/10.1145/224170.224426","url":null,"abstract":"This paper discusses the comprehensive performance profiling, improvement and benchmarking of a Computational Fluid Dynamics code, one of the Grand Challenge applications, on three popular multiprocessors. In the process of analyzing performance we considered language, compiler, architecture, and algorithmic changes and quantified each of them and their incremental contribution to bottom-line performance. We demonstrate that parallelization alone cannot result in significant gains if the granularity of parallel threads and the effect of parallelization on data locality are not taken into account. Unlike benchmarking studies that often focus on the performance or effectiveness of parallelizing compilers on specific loop kernels, we used the entire CFD code to measure the global effectiveness of compilers and parallel architectures. We probed the performance bottlenecks in each case and derived solutions which eliminate or neutralize the performance inhibiting factors. The major conclusion of our work is that overall performance is extremely sensitive to the synergetic effects of compiler optimizations, algorithmic and code tuning, and architectural idiosyncrasies.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114780606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Boundary Element Method is a widely-used discretization technique for solving boundary-value problems in engineering analysis. The solution of large problems by this method is limited by the storage and computational requirements for the generation and solution of large matrix systems resulting from the discretization. We discuss the implementation of these computations on the IBM SP-2 distributed-memory parallel computer, for applications involving the 3DD Laplace and Helmholtz equations.
{"title":"A Case Study in Parallel Scientific Computing: The Boundary Element Method on a Distributed-Memory Multicomputer","authors":"R. Natarajan, D. Krishnaswamy","doi":"10.1145/224170.224277","DOIUrl":"https://doi.org/10.1145/224170.224277","url":null,"abstract":"The Boundary Element Method is a widely-used discretization technique for solving boundary-value problems in engineering analysis. The solution of large problems by this method is limited by the storage and computational requirements for the generation and solution of large matrix systems resulting from the discretization. We discuss the implementation of these computations on the IBM SP-2 distributed-memory parallel computer, for applications involving the 3DD Laplace and Helmholtz equations.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122129826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, a number of researchers have investigated a class of algorithms that are based on multilevel graph partitioning that have moderate computational complexity, and provide excellent graph partitions. However, there exists little theoretical analysis that could explain the ability of multilevel algorithms to produce good partitions. In this paper we present such an analysis. Weshow under certain reasonable assumptions that even if no refinement is used in the uncoarsening phase, a good bisection of the coarser graph is worse than a good bisection of the finer graph by at most a small factor. We also show that for planar graphs, the size of a good vertex-separator of the coarse graph projected to the finer graph (without performing refinement in the uncoarsening phase) is higher than the size of a good vertex-separator of the finer graph by at most a small factor.
{"title":"Analysis of Multilevel Graph Partitioning","authors":"G. Karypis, Vipin Kumar","doi":"10.1145/224170.224229","DOIUrl":"https://doi.org/10.1145/224170.224229","url":null,"abstract":"Recently, a number of researchers have investigated a class of algorithms that are based on multilevel graph partitioning that have moderate computational complexity, and provide excellent graph partitions. However, there exists little theoretical analysis that could explain the ability of multilevel algorithms to produce good partitions. In this paper we present such an analysis. Weshow under certain reasonable assumptions that even if no refinement is used in the uncoarsening phase, a good bisection of the coarser graph is worse than a good bisection of the finer graph by at most a small factor. We also show that for planar graphs, the size of a good vertex-separator of the coarse graph projected to the finer graph (without performing refinement in the uncoarsening phase) is higher than the size of a good vertex-separator of the finer graph by at most a small factor.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115164635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hassan Fallah-Adl, J. JáJá, S. Liang, Y. Kaufman, J. Townshend
Remotely sensed imagery has been used for developing and validating various studies regarding land cover dynamics. However, the large amounts of imagery collected by the satellites are largely contaminated by the effects of atmospheric particles. The objective of atmospheric correction is to retrieve the surface reflectance from remotely sensed imagery by removing the atmospheric effects. We introduce a number of computational techniques that lead to a substantial speedup of an atmospheric correction algorithm based on using look-up tables. Excluding I/O time, the previous known implementation processes one pixel at a time and requires about 2.63 seconds per pixel on a SPARC-10 machine, while our implementation is based on processing the whole image and takes about 4-20 microseconds per pixel on the same machine. We also develop a parallel version of our algorithm that is scalable in terms of both computation and I/O. Experimental results obtained show that a Thematic Mapper (TM) image (36 MB per band, 5 bands need to be corrected) can be handled in less than 4.3 minutes on a 32-node CM-5 machine, including I/O time.
{"title":"Efficient Algorithms for Atmospheric Correction of Remotely Sensed Data","authors":"Hassan Fallah-Adl, J. JáJá, S. Liang, Y. Kaufman, J. Townshend","doi":"10.1145/224170.224194","DOIUrl":"https://doi.org/10.1145/224170.224194","url":null,"abstract":"Remotely sensed imagery has been used for developing and validating various studies regarding land cover dynamics. However, the large amounts of imagery collected by the satellites are largely contaminated by the effects of atmospheric particles. The objective of atmospheric correction is to retrieve the surface reflectance from remotely sensed imagery by removing the atmospheric effects. We introduce a number of computational techniques that lead to a substantial speedup of an atmospheric correction algorithm based on using look-up tables. Excluding I/O time, the previous known implementation processes one pixel at a time and requires about 2.63 seconds per pixel on a SPARC-10 machine, while our implementation is based on processing the whole image and takes about 4-20 microseconds per pixel on the same machine. We also develop a parallel version of our algorithm that is scalable in terms of both computation and I/O. Experimental results obtained show that a Thematic Mapper (TM) image (36 MB per band, 5 bands need to be corrected) can be handled in less than 4.3 minutes on a 32-node CM-5 machine, including I/O time.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115407894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We feel strongly that a contemporary introductory course in machine organization and assembly language should focus on the essentials of how computers execute programs, and not be distracted by the complications of the extraordinarily sophisticated microprocessors that are available today. These essentials should form a strong base of knowledge from which students can draw as they continue their education in computer science. Ideally these goals should be attained in an environment that fosters experimentation and cooperation, and with the aid of projects that generate interest and enthusiasm among the students. We have developed and are currently teaching a course at New Mexico State University that meets many of these goals. The course concentrates on a simple but relatively complete microprocessor architecture, that of the Motorola 68HC11 processor. Three different teaching techniques are used to encourage experimentation and team work: learning sessions, simulator labs, and microprocessor labs. New concepts are introduced in learning sessions, which combine traditional lecturing with student exploration. The understanding of these new concepts is strengthened through labs and assignments. Simulator labs and assignments, which require interaction with a simulator of the Motorola 68HC11 microprocessor, focus on the 68HC11's instruction set architecture. Microprocessor labs and assignments, which essentially are designing and building sessions, focus on the use of a 68HC11 microprocessor to control a motorized vehicle. During microprocessor labs students populate printed circuit cards, build motorized vehicles (or other roboticized exotica), and design and implement assembly language programs that provide communication between a personal computer and a 68HC11 processor, and a 68HC11 processor and a motorized vehicle. We have found that the costs of running this course are minimal and the results are very favorable in terms of student enthusiasm and achievement.
{"title":"Mobile Robots Teach Machine-Level Programming","authors":"P. Teller, T. Dunning","doi":"10.1145/224170.224205","DOIUrl":"https://doi.org/10.1145/224170.224205","url":null,"abstract":"We feel strongly that a contemporary introductory course in machine organization and assembly language should focus on the essentials of how computers execute programs, and not be distracted by the complications of the extraordinarily sophisticated microprocessors that are available today. These essentials should form a strong base of knowledge from which students can draw as they continue their education in computer science. Ideally these goals should be attained in an environment that fosters experimentation and cooperation, and with the aid of projects that generate interest and enthusiasm among the students. We have developed and are currently teaching a course at New Mexico State University that meets many of these goals. The course concentrates on a simple but relatively complete microprocessor architecture, that of the Motorola 68HC11 processor. Three different teaching techniques are used to encourage experimentation and team work: learning sessions, simulator labs, and microprocessor labs. New concepts are introduced in learning sessions, which combine traditional lecturing with student exploration. The understanding of these new concepts is strengthened through labs and assignments. Simulator labs and assignments, which require interaction with a simulator of the Motorola 68HC11 microprocessor, focus on the 68HC11's instruction set architecture. Microprocessor labs and assignments, which essentially are designing and building sessions, focus on the use of a 68HC11 microprocessor to control a motorized vehicle. During microprocessor labs students populate printed circuit cards, build motorized vehicles (or other roboticized exotica), and design and implement assembly language programs that provide communication between a personal computer and a 68HC11 processor, and a 68HC11 processor and a motorized vehicle. We have found that the costs of running this course are minimal and the results are very favorable in terms of student enthusiasm and achievement.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123556845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An important challenge in the area of distributed computing is to automate the selection of the parameters that control the distributed computation. A performance-critical parameter is the grain size of the computation, i.e., the interval between successive synchronization points in the application. This parameter is hard to select since it depends both on compile time (loop structure and data dependences, computational complexity) and run time components (speed of compute nodes and network). On networks of workstations that are shared with other users, the run-time parameters can change over time. As a result, it is also necessary to consider the interactions with dynamic load balancing, which is needed to achieve good performance in this environment. In this paper we present a method for automatically selecting the grain size of the computation consisting of nested DO loops. The method is based on close cooperation between the compiler and the runtime system. We evaluate the method using both simulation and measurements for an implementation on the Nectar multicomputer.
{"title":"Controlling Application Grain Size on a Network of Workstations","authors":"B. Siegell, P. Steenkiste","doi":"10.1145/224170.224497","DOIUrl":"https://doi.org/10.1145/224170.224497","url":null,"abstract":"An important challenge in the area of distributed computing is to automate the selection of the parameters that control the distributed computation. A performance-critical parameter is the grain size of the computation, i.e., the interval between successive synchronization points in the application. This parameter is hard to select since it depends both on compile time (loop structure and data dependences, computational complexity) and run time components (speed of compute nodes and network). On networks of workstations that are shared with other users, the run-time parameters can change over time. As a result, it is also necessary to consider the interactions with dynamic load balancing, which is needed to achieve good performance in this environment. In this paper we present a method for automatically selecting the grain size of the computation consisting of nested DO loops. The method is based on close cooperation between the compiler and the runtime system. We evaluate the method using both simulation and measurements for an implementation on the Nectar multicomputer.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129624621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe an architecture-adaptable methodology for the parallel implementation of finite element numerical models of physical systems. We use a model of time-dependent ocean currents as our working example. The heart of the computation is the solution of a banded linear system, and we describe an algorithm based on the domain decompositionmethod to solve the banded system. The algorithm is represented in a divide-and-conquer framework facilitates easy implementation of various algorithmic options. The process is straightforward and amenable to automation. We demonstrate the validity of this approach using two radically different target machine, a workstation network and a supercomputer. Our results show very good speedup on both platforms.
{"title":"Architecture-Adaptable Finite Element Modelling: A Case Study Using an Ocean Circulation Simulation","authors":"S. Kumaran, Robert N. Miller, M. J. Quinn","doi":"10.1145/224170.224501","DOIUrl":"https://doi.org/10.1145/224170.224501","url":null,"abstract":"We describe an architecture-adaptable methodology for the parallel implementation of finite element numerical models of physical systems. We use a model of time-dependent ocean currents as our working example. The heart of the computation is the solution of a banded linear system, and we describe an algorithm based on the domain decompositionmethod to solve the banded system. The algorithm is represented in a divide-and-conquer framework facilitates easy implementation of various algorithmic options. The process is straightforward and amenable to automation. We demonstrate the validity of this approach using two radically different target machine, a workstation network and a supercomputer. Our results show very good speedup on both platforms.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125500354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}