首页 > 最新文献

Proceedings of the IEEE/ACM SC95 Conference最新文献

英文 中文
Interprocedural Compilation of Irregular Applications for Distributed Memory Machines 分布式内存机中不规则应用程序的程序间编译
Pub Date : 1995-12-08 DOI: 10.1145/224170.224336
G. Agrawal, J. Saltz
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations, placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.
数据并行语言如高性能Fortran (High Performance Fortran, HPF)正逐渐成为分布式内存并行机的独立架构编程模式。在本文中,我们提出了编译具有不规则数据访问模式的应用程序所需的过程间优化,当用这种数据并行语言编码时。我们开发了一种过程间部分冗余消除(IPRE)算法,用于优化运行时预处理例程和为管理此类代码中的通信而插入的集体通信例程的放置。我们还提出了两种新的程序间优化,分散例程的放置和合并和增量例程的使用。然后,我们描述了如何将程序切片用于在更复杂的场景中进一步应用IPRE。我们已经使用Fortran D编译系统作为必要的基础设施,对这里提出的方案进行了初步实现。我们给出了用我们的系统编译的两个代码的实验结果,以证明所提出的方案的有效性。
{"title":"Interprocedural Compilation of Irregular Applications for Distributed Memory Machines","authors":"G. Agrawal, J. Saltz","doi":"10.1145/224170.224336","DOIUrl":"https://doi.org/10.1145/224170.224336","url":null,"abstract":"Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations, placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115689410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems 稀疏线性系统直接解的正向和反向代换并行算法
Pub Date : 1995-12-08 DOI: 10.1145/224170.224471
Anshul Gupta, Vipin Kumar
A few parallel algorithms for solving triangular systems resulting from parallel factorization of sparse linear systems have been proposed and implemented recently. We present a detailed analysis of parallel complexity and scalability of the best of these algorithms and the results of its implementation on up to 256 processors of the Cray T3D parallel computer. It has been a common belief that parallel sparse triangular solvers are quite unscalable due to a high communication to computation ratio. Our analysis and experiments show that, although not as scalable as the best parallel sparse Cholesky factorization algorithms, parallel sparse triangular solvers can yield reasonable speedups in runtime on hundreds of processors. We also show that for a wide class of problems, the sparse triangular solvers described in this paper are optimal and are asymptotically as scalable as a dense triangular solver.
近年来,人们提出并实现了几种求解由稀疏线性系统并行分解而成的三角形系统的并行算法。我们详细分析了这些算法中的最佳算法的并行复杂性和可扩展性,并在多达256个处理器的Cray T3D并行计算机上实现了结果。人们普遍认为,由于并行稀疏三角形求解器的通信计算比很高,因此难以扩展。我们的分析和实验表明,尽管不像最好的并行稀疏Cholesky分解算法那样具有可扩展性,但并行稀疏三角形解算器可以在数百个处理器上的运行时产生合理的加速。我们还证明了对于一类广泛的问题,本文所描述的稀疏三角形解是最优的,并且与密集三角形解具有渐近可扩展性。
{"title":"Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems","authors":"Anshul Gupta, Vipin Kumar","doi":"10.1145/224170.224471","DOIUrl":"https://doi.org/10.1145/224170.224471","url":null,"abstract":"A few parallel algorithms for solving triangular systems resulting from parallel factorization of sparse linear systems have been proposed and implemented recently. We present a detailed analysis of parallel complexity and scalability of the best of these algorithms and the results of its implementation on up to 256 processors of the Cray T3D parallel computer. It has been a common belief that parallel sparse triangular solvers are quite unscalable due to a high communication to computation ratio. Our analysis and experiments show that, although not as scalable as the best parallel sparse Cholesky factorization algorithms, parallel sparse triangular solvers can yield reasonable speedups in runtime on hundreds of processors. We also show that for a wide class of problems, the sparse triangular solvers described in this paper are optimal and are asymptotically as scalable as a dense triangular solver.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126782265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Relative Debugging and its Application to the Development of Large Numerical Models 相对调试及其在大型数值模型开发中的应用
Pub Date : 1995-12-08 DOI: 10.1145/224170.224350
D. Abramson, Ian T Foster, J. Michalakes, R. Sosič
Because large scientific codes are rarely static objects, developers are often faced with the tedious task of accounting for discrepancies between new and old versions. In this paper, we describe a new technique called relative debugging that addresses this problem by automating the process of comparing a modified code against a correct reference code. We examine the utility of the relative debugging technique by applying a relative debugger called Guard to a range of debugging problems in a large atmospheric circulation model. Our experience confirms the effectiveness of the approach. Using Guard, we are able to validate a new sequential version of the atmospheric model, and to identify the source of a significant discrepancy in a parallel version in a short period of time.
由于大型科学代码很少是静态对象,开发人员经常面临解释新旧版本之间差异的繁琐任务。在本文中,我们描述了一种称为相对调试的新技术,它通过将修改后的代码与正确的参考代码进行比较的过程自动化来解决这个问题。我们通过将一个名为Guard的相对调试器应用于大型大气环流模型中的一系列调试问题来检验相对调试技术的实用性。我们的经验证实了这种做法的有效性。使用Guard,我们能够验证一个新的连续版本的大气模型,并在短时间内确定平行版本中显著差异的来源。
{"title":"Relative Debugging and its Application to the Development of Large Numerical Models","authors":"D. Abramson, Ian T Foster, J. Michalakes, R. Sosič","doi":"10.1145/224170.224350","DOIUrl":"https://doi.org/10.1145/224170.224350","url":null,"abstract":"Because large scientific codes are rarely static objects, developers are often faced with the tedious task of accounting for discrepancies between new and old versions. In this paper, we describe a new technique called relative debugging that addresses this problem by automating the process of comparing a modified code against a correct reference code. We examine the utility of the relative debugging technique by applying a relative debugger called Guard to a range of debugging problems in a large atmospheric circulation model. Our experience confirms the effectiveness of the approach. Using Guard, we are able to validate a new sequential version of the atmospheric model, and to identify the source of a significant discrepancy in a parallel version in a short period of time.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116227369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Quantum Chromodynamics Simulation on NWT NWT的量子色动力学模拟
Pub Date : 1995-12-08 DOI: 10.1145/224170.224403
M. Yoshida, A. Nakamura, M. Fukuda, Takashi Nakamura, S. Hioki
A portable QCD simulation program running on NWT with 128 PE's shows the performance 0.032 micro sec/link update, which is about 189 times faster than a highly optimized code on CRAY X-MP/48 with four processors. The performance corresponds to 178 GFLOPS sustained speed.
一个运行在带有128个PE的NWT上的便携式QCD模拟程序显示,其性能为0.032微秒/链接更新,比在带有4个处理器的CRAY X-MP/48上高度优化的代码快约189倍。性能相当于178 GFLOPS的持续速度。
{"title":"Quantum Chromodynamics Simulation on NWT","authors":"M. Yoshida, A. Nakamura, M. Fukuda, Takashi Nakamura, S. Hioki","doi":"10.1145/224170.224403","DOIUrl":"https://doi.org/10.1145/224170.224403","url":null,"abstract":"A portable QCD simulation program running on NWT with 128 PE's shows the performance 0.032 micro sec/link update, which is about 189 times faster than a highly optimized code on CRAY X-MP/48 with four processors. The performance corresponds to 178 GFLOPS sustained speed.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128258047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
High-Performance Incremental Scheduling on Massively Parallel Computers — A Global Approach 大规模并行计算机上的高性能增量调度&#8212全球策略
Pub Date : 1995-12-08 DOI: 10.1145/224170.224358
Minyou Wu, W. Shu
Runtime incremental parallel scheduling (RIPS) is a new approach for load balancing. In parallel scheduling, all processors cooperate together to balance the workload. Parallel scheduling accurately balances the load by using global load information. In incremental scheduling, the system scheduling activity alternates with the underlying computation work. RIPS produces high-quality load balancing and adapts to applications of nonuniform structures. This paper presents methods for scheduling a single job on a dedicated parallel machine.
运行时增量并行调度(RIPS)是一种新的负载均衡方法。在并行调度中,所有处理器一起协作以平衡工作负载。并行调度通过使用全局负载信息精确地平衡负载。在增量调度中,系统调度活动与底层计算工作交替进行。RIPS产生高质量的负载平衡,并适应非均匀结构的应用。本文提出了在专用并行机上调度单个作业的方法。
{"title":"High-Performance Incremental Scheduling on Massively Parallel Computers — A Global Approach","authors":"Minyou Wu, W. Shu","doi":"10.1145/224170.224358","DOIUrl":"https://doi.org/10.1145/224170.224358","url":null,"abstract":"Runtime incremental parallel scheduling (RIPS) is a new approach for load balancing. In parallel scheduling, all processors cooperate together to balance the workload. Parallel scheduling accurately balances the load by using global load information. In incremental scheduling, the system scheduling activity alternates with the underlying computation work. RIPS produces high-quality load balancing and adapts to applications of nonuniform structures. This paper presents methods for scheduling a single job on a dedicated parallel machine.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133414439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Lazy Release Consistency for Hardware-Coherent Multiprocessors 硬件相干多处理器的延迟释放一致性
Pub Date : 1995-12-08 DOI: 10.1145/224170.224398
L. Kontothanassis, M. Scott, R. Bianchini
Release consistency is a widely accepted memory model for distributed shared memory systems. Eager release consistency represents the state of the art in release consistent protocols for hardware-coherent multiprocessors, while lazy release consistency has been shown to provide better performance for software distributed shared memory (DSM). Several of the optimizations performed by lazy protocols have the potential to improve the performance of hardware-coherent multiprocessors as well, but their complexity has precluded a hardware implementation. With the advent of programmable protocol processors it may become possible to use them after all. We present and evaluate a lazy release-consistent protocol suitable for machines with dedicated protocol processors. This protocol admits multiple concurrent writers, sends write notices concurrently with computation, and delays invalidations until acquire operations. We also consider a lazier protocol that delays sending write notices until release operations. Our results indicate that the first protocol outperforms eager release consistency by as much as 20% across a variety of applications. The lazier protocol, on the other hand, is unable to recoup its high synchronization overhead. This represents a qualitative shift from the DSM world, where lazier protocols always yield performance improvements. Based on our results, we conclude that machines with flexible hardware support for coherence should use protocols based on lazy release consistency, but in a less ''aggressively lazy'' form than is appropriate for DSM.
发布一致性是一种被广泛接受的分布式共享内存系统内存模型。急于发布一致性代表了硬件相干多处理器发布一致性协议的最新状态,而延迟发布一致性已被证明可以为软件分布式共享内存(DSM)提供更好的性能。惰性协议执行的一些优化也有可能提高硬件相干多处理器的性能,但是它们的复杂性阻碍了硬件实现。随着可编程协议处理器的出现,最终使用它们可能成为可能。我们提出并评估了一个适用于具有专用协议处理器的机器的延迟释放一致协议。该协议允许多个并发写入器,在计算时并发发送写通知,并将失效延迟到获取操作。我们还考虑了一个更懒的协议,延迟发送写通知,直到释放操作。我们的结果表明,在各种应用程序中,第一种协议比渴望发布一致性高出20%。另一方面,较懒的协议无法收回其高同步开销。这代表了DSM世界的一个质的转变,在DSM世界中,更懒惰的协议总是产生性能改进。根据我们的结果,我们得出结论,具有灵活硬件支持一致性的机器应该使用基于延迟发布一致性的协议,但是以一种不那么“激进的延迟”形式,而不是适合DSM的形式。
{"title":"Lazy Release Consistency for Hardware-Coherent Multiprocessors","authors":"L. Kontothanassis, M. Scott, R. Bianchini","doi":"10.1145/224170.224398","DOIUrl":"https://doi.org/10.1145/224170.224398","url":null,"abstract":"Release consistency is a widely accepted memory model for distributed shared memory systems. Eager release consistency represents the state of the art in release consistent protocols for hardware-coherent multiprocessors, while lazy release consistency has been shown to provide better performance for software distributed shared memory (DSM). Several of the optimizations performed by lazy protocols have the potential to improve the performance of hardware-coherent multiprocessors as well, but their complexity has precluded a hardware implementation. With the advent of programmable protocol processors it may become possible to use them after all. We present and evaluate a lazy release-consistent protocol suitable for machines with dedicated protocol processors. This protocol admits multiple concurrent writers, sends write notices concurrently with computation, and delays invalidations until acquire operations. We also consider a lazier protocol that delays sending write notices until release operations. Our results indicate that the first protocol outperforms eager release consistency by as much as 20% across a variety of applications. The lazier protocol, on the other hand, is unable to recoup its high synchronization overhead. This represents a qualitative shift from the DSM world, where lazier protocols always yield performance improvements. Based on our results, we conclude that machines with flexible hardware support for coherence should use protocols based on lazy release consistency, but in a less ''aggressively lazy'' form than is appropriate for DSM.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124813053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
MONSTER — The Ghost in the Connection Machine: Modularity of Neural Systems in Theoretical Evolutionary Research 怪物& # 8212;连接机器中的幽灵:理论进化研究中神经系统的模块化
Pub Date : 1995-12-08 DOI: 10.1145/224170.224226
Nigel Snoad, T. Bossomaier
Both genetic algorithms (GAs) and artificial neural networks (ANNs) (connectionist learning models) are effective generalisations of successful biological techniques to the artificial realm. Both techniques are inherently parallel and seem ideal for implementation on the current generation of parallel supercomputers. We consider how the two techniques complement each other and how combining them (i.e. evolving artificial neural networks with a genetic algorithm), may give insights into the evolution of structure and modularity in biological brains. The incorporation of evolutionary and modularity concepts into artificial systems has the potential to decrease the development time of ANNs for specific image and information processing applications. General considerations when genetically encoding ANNs are discussed, and a new encoding method developed, which has the potential to simplify the generation of complex modular networks. The implementation of this technique on a CM-5 parallel supercomputer raises many practical and theoretical questions in the application and use of evolutionary models with artificial neural networks.
遗传算法(GAs)和人工神经网络(ANNs)(连接主义学习模型)都是成功的生物技术在人工领域的有效推广。这两种技术本质上都是并行的,似乎非常适合在当前一代并行超级计算机上实现。我们考虑这两种技术如何相互补充,以及如何将它们结合起来(即进化人工神经网络与遗传算法),可以深入了解生物大脑的结构和模块化的进化。将进化和模块化概念结合到人工系统中有可能减少针对特定图像和信息处理应用的人工神经网络的开发时间。讨论了遗传编码人工神经网络时的一般考虑因素,并开发了一种新的编码方法,该方法有可能简化复杂模块化网络的生成。该技术在CM-5并行超级计算机上的实现,为人工神经网络进化模型的应用和使用提出了许多实际和理论问题。
{"title":"MONSTER — The Ghost in the Connection Machine: Modularity of Neural Systems in Theoretical Evolutionary Research","authors":"Nigel Snoad, T. Bossomaier","doi":"10.1145/224170.224226","DOIUrl":"https://doi.org/10.1145/224170.224226","url":null,"abstract":"Both genetic algorithms (GAs) and artificial neural networks (ANNs) (connectionist learning models) are effective generalisations of successful biological techniques to the artificial realm. Both techniques are inherently parallel and seem ideal for implementation on the current generation of parallel supercomputers. We consider how the two techniques complement each other and how combining them (i.e. evolving artificial neural networks with a genetic algorithm), may give insights into the evolution of structure and modularity in biological brains. The incorporation of evolutionary and modularity concepts into artificial systems has the potential to decrease the development time of ANNs for specific image and information processing applications. General considerations when genetically encoding ANNs are discussed, and a new encoding method developed, which has the potential to simplify the generation of complex modular networks. The implementation of this technique on a CM-5 parallel supercomputer raises many practical and theoretical questions in the application and use of evolutionary models with artificial neural networks.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129434892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms 在各种架构平台上并行化Navier-Stokes计算
Pub Date : 1995-12-08 DOI: 10.1145/224170.224410
D. Jayasimha, M. Hayder, S. K. Pillay
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies — the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms
我们研究了计算流体动力学应用程序的计算、通信和可扩展性特征,该应用程序使用可压缩的Navier-Stokes方程在各种并行架构平台上求解时间精确的射流流场。本研究选择的平台是工作站集群(NASA Lewis的LACE实验试验台)、共享内存多处理器(Cray YMP)、具有不同拓扑结构的分布式内存多处理器——IBM SP和Cray T3D。我们研究了连接工作站集群的各种网络对应用程序性能的影响,以及用于并行化的流行消息传递库引起的开销。这项工作还强调了将内存带宽与处理器速度匹配以获得良好的单处理器性能的重要性。通过研究应用程序在各种体系结构上的性能,我们能够指出每个示例计算平台的优点和缺点
{"title":"Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms","authors":"D. Jayasimha, M. Hayder, S. K. Pillay","doi":"10.1145/224170.224410","DOIUrl":"https://doi.org/10.1145/224170.224410","url":null,"abstract":"We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies — the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128295216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallel Matrix-Vector Product Using Approximate Hierarchical Methods 并行矩阵向量积的近似层次方法
Pub Date : 1995-12-08 DOI: 10.1145/224170.224487
A. Grama, Vipin Kumar, A. Sameh
Matrix-vector products (mat-vecs) form the core of iterative methods used for solving dense linear systems. Often, these systems arise in the solution of integral equations used in electromagnetics, heat transfer, and wave propagation. In this paper, we present a parallel approximate method for computing mat-vecs used in the solution of integral equations. We use this method to compute dense mat-vecs of hundreds of thousands of elements. The combined speedups obtained from the use of approximate methods and parallel processing represent an improvement of several orders of magnitude over exact mat-vecs on uniprocessors. We demonstrate that our parallel formulation incurs minimal parallel processing overhead and scales up to a large number of processors. We study the impact of varying the accuracy of the approximate mat-vec on overall time and on parallel efficiency. Experimental results are presented for 256 processor Cray T3D and Thinking Machines CM5 parallel computers. We have achieved computation rates in excess of 5 GFLOPS on the T3D.
矩阵-向量积(mat-vecs)是求解密集线性系统的迭代方法的核心。通常,这些系统出现在电磁学、热传导和波传播中使用的积分方程的解中。本文给出了一种求解积分方程中栅格向量的并行近似方法。我们使用这种方法来计算包含数十万个元素的密集网格。使用近似方法和并行处理获得的联合加速比单处理器上的精确mat-vec提高了几个数量级。我们证明,我们的并行公式产生最小的并行处理开销,并扩展到大量处理器。我们研究了改变近似矩阵的精度对总时间和并行效率的影响。给出了256处理器Cray T3D和Thinking Machines CM5并行计算机的实验结果。我们已经在T3D上实现了超过5 GFLOPS的计算速率。
{"title":"Parallel Matrix-Vector Product Using Approximate Hierarchical Methods","authors":"A. Grama, Vipin Kumar, A. Sameh","doi":"10.1145/224170.224487","DOIUrl":"https://doi.org/10.1145/224170.224487","url":null,"abstract":"Matrix-vector products (mat-vecs) form the core of iterative methods used for solving dense linear systems. Often, these systems arise in the solution of integral equations used in electromagnetics, heat transfer, and wave propagation. In this paper, we present a parallel approximate method for computing mat-vecs used in the solution of integral equations. We use this method to compute dense mat-vecs of hundreds of thousands of elements. The combined speedups obtained from the use of approximate methods and parallel processing represent an improvement of several orders of magnitude over exact mat-vecs on uniprocessors. We demonstrate that our parallel formulation incurs minimal parallel processing overhead and scales up to a large number of processors. We study the impact of varying the accuracy of the approximate mat-vec on overall time and on parallel efficiency. Experimental results are presented for 256 processor Cray T3D and Thinking Machines CM5 parallel computers. We have achieved computation rates in excess of 5 GFLOPS on the T3D.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129651477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Index Array Flattening Through Program Transformation 通过程序变换使索引数组平坦化
Pub Date : 1995-12-08 DOI: 10.1145/224170.224420
R. Das, P. Havlak, J. Saltz, K. Kennedy
This paper presents techniques for compiling loops with complex, indirect array accesses into loops whose array references have at most one level of indirection. The transformation allows prefetching of array indices for more efficient structuring of communication on distributed-memory machines. It can also improve performance on other architectures by enabling prefetching of data between levels of the memory hierarchy or exploitation of hardware support for vectorized gather/scatter. Our techniques are implemented in a compiler for Fortran D and execution speed improvements are given for multiprocessor and vector machines.
本文介绍了将具有复杂间接数组访问的循环编译成数组引用最多只有一层间接的循环的技术。这种转换允许预先获取数组索引,以便在分布式内存机器上更有效地构建通信结构。它还可以通过支持在内存层次结构的各个级别之间预取数据或利用硬件支持向量化收集/分散来提高其他体系结构上的性能。我们的技术在Fortran D的编译器中实现,并给出了多处理器和向量机的执行速度改进。
{"title":"Index Array Flattening Through Program Transformation","authors":"R. Das, P. Havlak, J. Saltz, K. Kennedy","doi":"10.1145/224170.224420","DOIUrl":"https://doi.org/10.1145/224170.224420","url":null,"abstract":"This paper presents techniques for compiling loops with complex, indirect array accesses into loops whose array references have at most one level of indirection. The transformation allows prefetching of array indices for more efficient structuring of communication on distributed-memory machines. It can also improve performance on other architectures by enabling prefetching of data between levels of the memory hierarchy or exploitation of hardware support for vectorized gather/scatter. Our techniques are implemented in a compiler for Fortran D and execution speed improvements are given for multiprocessor and vector machines.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127462353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
期刊
Proceedings of the IEEE/ACM SC95 Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1