首页 > 最新文献

[1993] Proceedings Seventh International Parallel Processing Symposium最新文献

英文 中文
A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction 具有内存缩减的Strassen矩阵乘法算法的张量积公式
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262814
B. Kumar, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.<>
一种基于张量积的编程方法被用于设计和实现并行和矢量多处理器的块递归算法。Strassen矩阵乘法算法之前的张量积公式需要大小为0 (7/sup n/)的工作数组来乘以2/sup n/*2/sup n/矩阵。提出了一种改进的Strassen算法的张量积公式,其中工作数组的大小可以减少到O(4/sup n/)。改进后的公式具有足够的并行和矢量运算,可以有效地实现。给出了Cray Y-MP的性能结果。
{"title":"A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction","authors":"B. Kumar, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan","doi":"10.1109/IPPS.1993.262814","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262814","url":null,"abstract":"A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132027539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Automatic parallelization of LINPACK routines on distributed memory parallel processors 分布式内存并行处理器上LINPACK例程的自动并行化
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262774
M. Neeracher, R. Rühl
Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, developed as part of the K2 project, accepts conventional Fortran code, augmented with code and data distribution directives. These directives support a global name space through a run-time mechanism called data consistency analysis. Many sequential Fortran programs can be efficiently parallelized, with Oxygen directives introduced manually by the user into the sequential code. This work presents an analysis pass added to the compiler that makes suggestions for the directives to be inserted into the code. Automatic parallelization of LINPACK routines was attempted and results are given.<>
分布式内存并行处理器(dmpp)没有对全局地址空间的硬件支持。然而,用顺序命令式语言(如Fortran)编写的传统程序通常操作很少的大型数组。作为K2项目的一部分开发的Oxygen编译器可以接受传统的Fortran代码,并增强了代码和数据分发指令。这些指令通过称为数据一致性分析的运行时机制支持全局名称空间。许多顺序Fortran程序可以有效地并行化,用户可以在顺序代码中手动引入Oxygen指令。这项工作提供了一个添加到编译器中的分析通道,该通道为要插入到代码中的指令提供建议。尝试了LINPACK例程的自动并行化,并给出了结果
{"title":"Automatic parallelization of LINPACK routines on distributed memory parallel processors","authors":"M. Neeracher, R. Rühl","doi":"10.1109/IPPS.1993.262774","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262774","url":null,"abstract":"Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, developed as part of the K2 project, accepts conventional Fortran code, augmented with code and data distribution directives. These directives support a global name space through a run-time mechanism called data consistency analysis. Many sequential Fortran programs can be efficiently parallelized, with Oxygen directives introduced manually by the user into the sequential code. This work presents an analysis pass added to the compiler that makes suggestions for the directives to be inserted into the code. Automatic parallelization of LINPACK routines was attempted and results are given.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133195735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Scheduling in and out forests in the presence of communication delays 在存在通信延迟的情况下调度进出森林
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262886
T. Varvarigou, V. Roychowdhury, T. Kailath, E. Lawler
The authors consider the problem of scheduling tasks on multiprocessor architectures in the presence of communication delays. Given a set of dependent tasks, the scheduling problem is to allocate the tasks to processors such that the pre-specified precedence constraints among the tasks are obeyed and certain cost-measures (such as computation time) are minimized. Several cases of the scheduling problem have been proven to be NP-complete. Nevertheless, there are polynomial time algorithms for several interesting special cases of the general scheduling problem. Most of these results, however, do not take into consideration the delays due to message passing among processors. The authors study the increase in time complexity of the scheduling problem due to the introduction of communication delays. In particular, they address the open problem of scheduling out-forests (in-forests) in a multiprocessor system of m identical processors when communication delays are considered. They present first known polynomial time algorithms for the computation of the optimal schedule when the number of available processors is given and bounded and both computation and communication delays are assumed to take one unit of time.<>
研究了存在通信延迟的多处理器架构下的任务调度问题。给定一组相互依赖的任务,调度问题是将任务分配给处理器,使任务之间遵守预先指定的优先级约束,并使某些成本度量(如计算时间)最小化。有几个例子证明了调度问题是np完全的。然而,对于一般调度问题的一些有趣的特殊情况,有多项式时间算法。然而,这些结果中的大多数都没有考虑到由于处理器之间的消息传递而导致的延迟。作者研究了由于引入通信延迟而增加的调度问题的时间复杂度。特别是,当考虑通信延迟时,它们解决了在m个相同处理器的多处理器系统中调度出林(林内)的开放问题。他们提出了已知的第一个多项式时间算法,用于计算最优调度,当可用处理器的数量是给定的和有界的,并且计算和通信延迟都假设为一个单位时间。
{"title":"Scheduling in and out forests in the presence of communication delays","authors":"T. Varvarigou, V. Roychowdhury, T. Kailath, E. Lawler","doi":"10.1109/IPPS.1993.262886","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262886","url":null,"abstract":"The authors consider the problem of scheduling tasks on multiprocessor architectures in the presence of communication delays. Given a set of dependent tasks, the scheduling problem is to allocate the tasks to processors such that the pre-specified precedence constraints among the tasks are obeyed and certain cost-measures (such as computation time) are minimized. Several cases of the scheduling problem have been proven to be NP-complete. Nevertheless, there are polynomial time algorithms for several interesting special cases of the general scheduling problem. Most of these results, however, do not take into consideration the delays due to message passing among processors. The authors study the increase in time complexity of the scheduling problem due to the introduction of communication delays. In particular, they address the open problem of scheduling out-forests (in-forests) in a multiprocessor system of m identical processors when communication delays are considered. They present first known polynomial time algorithms for the computation of the optimal schedule when the number of available processors is given and bounded and both computation and communication delays are assumed to take one unit of time.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128982889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Explicit parallel structuring for rule-based programming 基于规则的编程的显式并行结构
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262829
Shiow-yang Wu, J. Browne
This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.<>
针对基于规则的编程系统,提出了一种基于语义的显式并行结构。显式并行结构似乎是必要的,因为顺序程序的编译时依赖分析并没有产生大规模的并行性,而并行性的运行时分析受到分析的执行成本的限制。指定规则语义的简单语言扩展用于在规则级别定义并行执行行为。工作内存元素的类型定义被扩展到包括对象内部和对象之间的关系,这些关系定义了对象类型实例上允许的并行性。提出的第一个结果是,常用的基于基准规则的程序实现的算法包含可伸缩的并行性。第二个结果是,大部分并行性可以通过基于规则的语言的简单而适度的扩展来捕获,这些扩展类似于命令式编程语言中用于规范并行结构的模型和构造。给出了一个综合语言系统的草图,该系统利用语义规范在基于规则的程序的对象定义段和可执行段中定义并行结构。
{"title":"Explicit parallel structuring for rule-based programming","authors":"Shiow-yang Wu, J. Browne","doi":"10.1109/IPPS.1993.262829","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262829","url":null,"abstract":"This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A portable parallel algorithm for VLSI circuit extraction 一种便携式VLSI电路提取并行算法
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262922
B. Ramkumar, P. Banerjee
The authors describe a new portable algorithm for parallel circuit extraction. The algorithm is built as part of the ongoing ProperCAD project: a portable object-oriented parallel environment for CAD applications that is built on top of the CHARM system. The algorithm, unlike prior approaches like PACE is asynchronous and is based on a coarse-grained dataflow execution model. Performance of circuit extraction is presented on four parallel machines: an Encore Multimax, a Sequent Symmetry, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. The extractor runs unchanged on all these machines.<>
提出了一种新的便携式并行电路提取算法。该算法是作为正在进行的ProperCAD项目的一部分构建的:一个可移植的面向对象的并行环境,用于CAD应用程序,建立在CHARM系统之上。与先前的方法(如PACE)不同,该算法是异步的,并且基于粗粒度数据流执行模型。介绍了电路提取在四个并行机器上的性能:Encore multiax、sequence Symmetry、NCUBE 2超立方体和Sun Sparc工作站网络。提取器在所有这些机器上都是不变的。
{"title":"A portable parallel algorithm for VLSI circuit extraction","authors":"B. Ramkumar, P. Banerjee","doi":"10.1109/IPPS.1993.262922","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262922","url":null,"abstract":"The authors describe a new portable algorithm for parallel circuit extraction. The algorithm is built as part of the ongoing ProperCAD project: a portable object-oriented parallel environment for CAD applications that is built on top of the CHARM system. The algorithm, unlike prior approaches like PACE is asynchronous and is based on a coarse-grained dataflow execution model. Performance of circuit extraction is presented on four parallel machines: an Encore Multimax, a Sequent Symmetry, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. The extractor runs unchanged on all these machines.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"485 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116691718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Why BSP computers? (bulk-synchronous parallel computers) 为什么是BSP电脑?(批量同步并行计算机)
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262847
L. Valiant
The author gives a summary of some of the arguments favoring the adoption of the bulk-synchronous parallel (BSP) model as a standard for parallel computing. First, he argues that for parallel computing to become a major industry, agreement has to be reached on a standard model at a level intermediate between the language and architecture levels. He goes on to list the factors that make the BSP model attractive as a standard at this intermediate or bridging level. Finally, he provides some reasons for favoring it over the shared memory or PRAM model which is an alternative candidate for this role.<>
作者总结了支持采用批量同步并行(BSP)模型作为并行计算标准的一些论点。首先,他认为,为了使并行计算成为一个主要产业,必须在语言和体系结构之间的一个层次上就标准模型达成一致。他接着列举了使BSP模型在这个中间或桥接级别作为标准具有吸引力的因素。最后,他提供了一些支持它的原因,而不是共享内存或PRAM模型,后者是此角色的备选方案
{"title":"Why BSP computers? (bulk-synchronous parallel computers)","authors":"L. Valiant","doi":"10.1109/IPPS.1993.262847","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262847","url":null,"abstract":"The author gives a summary of some of the arguments favoring the adoption of the bulk-synchronous parallel (BSP) model as a standard for parallel computing. First, he argues that for parallel computing to become a major industry, agreement has to be reached on a standard model at a level intermediate between the language and architecture levels. He goes on to list the factors that make the BSP model attractive as a standard at this intermediate or bridging level. Finally, he provides some reasons for favoring it over the shared memory or PRAM model which is an alternative candidate for this role.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Mapping onto three classes of parallel machines: a case study using the cyclic reduction algorithm 映射到三类并行机器:使用循环约简算法的案例研究
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262888
G. Saghi, H. Siegel, J. L. Gray
Mapping cyclic reduction, a known approach for the parallel solution of tridiagonal systems of equations, onto the MasPar MP-1, nCUBE 2, and PASM parallel machines is discussed. Each of these represents a different mode of parallelism. Issues addressed are SIMD/MIMD trade-offs, the effect on execution time of increasing the number of processors used, the impact of the inter-processor communications network on performance, the importance of predicting algorithm performance as a function of the mapping used, and the advantages of a partitionable system. Analytical results are validated by experimentation on all three machines.<>
讨论了在MasPar MP-1、nCUBE - 2和PASM并行机上映射循环约简这一已知的三对角线方程组并行解的方法。每一个都代表了一种不同的并行模式。本文讨论的问题包括SIMD/MIMD的权衡、增加所使用的处理器数量对执行时间的影响、处理器间通信网络对性能的影响、预测算法性能作为所使用映射函数的重要性,以及可分区系统的优势。在三台机器上的实验验证了分析结果
{"title":"Mapping onto three classes of parallel machines: a case study using the cyclic reduction algorithm","authors":"G. Saghi, H. Siegel, J. L. Gray","doi":"10.1109/IPPS.1993.262888","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262888","url":null,"abstract":"Mapping cyclic reduction, a known approach for the parallel solution of tridiagonal systems of equations, onto the MasPar MP-1, nCUBE 2, and PASM parallel machines is discussed. Each of these represents a different mode of parallelism. Issues addressed are SIMD/MIMD trade-offs, the effect on execution time of increasing the number of processors used, the impact of the inter-processor communications network on performance, the importance of predicting algorithm performance as a function of the mapping used, and the advantages of a partitionable system. Analytical results are validated by experimentation on all three machines.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125994287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Load balancing of DOALL loops in the Perfect Club 完美俱乐部DOALL循环的负载平衡
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262868
G. Elsesser, Viet N. Ngo, S. Bhattacharya, W. Tsai
The speedup achieved by concurrent execution of loop iterations is determined by load balance and several other factors, so no single strategy provides maximum speedup for all classes of programs and all target architectures. Hence, the selection of a load balancing strategy must be guided by characteristics of both the application domain and the target machine architecture. The authors study loop load balance in the context of the well known Perfect Club benchmark. Several static and dynamic characteristics of DOALL loops are observed and interpreted. Late arrival of processors is identified as a significant source of load imbalance. A scheme for processor preallocation is proposed and the advantages and applicability of this scheme are demonstrated by analytical estimates as well as experimental evaluation on a Cray YMP-8.<>
通过并发执行循环迭代实现的加速是由负载平衡和其他几个因素决定的,因此没有一种策略可以为所有类型的程序和所有目标体系结构提供最大的加速。因此,负载平衡策略的选择必须以应用程序域和目标机器体系结构的特征为指导。作者在著名的Perfect Club基准中研究循环负载平衡。观察并解释了DOALL循环的静态和动态特性。处理器的延迟到达被认为是负载不平衡的一个重要来源。提出了一种处理器预分配方案,并通过在Cray ymp - 8b>上的分析估计和实验评估证明了该方案的优点和适用性
{"title":"Load balancing of DOALL loops in the Perfect Club","authors":"G. Elsesser, Viet N. Ngo, S. Bhattacharya, W. Tsai","doi":"10.1109/IPPS.1993.262868","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262868","url":null,"abstract":"The speedup achieved by concurrent execution of loop iterations is determined by load balance and several other factors, so no single strategy provides maximum speedup for all classes of programs and all target architectures. Hence, the selection of a load balancing strategy must be guided by characteristics of both the application domain and the target machine architecture. The authors study loop load balance in the context of the well known Perfect Club benchmark. Several static and dynamic characteristics of DOALL loops are observed and interpreted. Late arrival of processors is identified as a significant source of load imbalance. A scheme for processor preallocation is proposed and the advantages and applicability of this scheme are demonstrated by analytical estimates as well as experimental evaluation on a Cray YMP-8.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127947439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A multi-level hierarchical cache coherence protocol for multiprocessors 用于多处理器的多级分层缓存一致性协议
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262871
Craig Anderson, J. Baer
In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, lack of bus bandwidth restricts the number of processors that can be effectively put on a single bus machine. One architecture that has been proposed to solve the limited bandwidth problem consists of processors connected via a tree hierarchy of buses. The authors present a tool to study a hierarchical bus based shared-memory system. They highlight the main features of a hierarchical cache coherence protocol and give some preliminary performance results obtained via an instruction level simulator.<>
为了满足未来十年的计算需求,共享内存处理器必须具有可扩展性。尽管单个共享总线架构在过去已经取得了成功,但是总线带宽的缺乏限制了可以有效地放在单个总线机器上的处理器数量。已经提出的解决有限带宽问题的一种体系结构由通过总线的树型层次结构连接的处理器组成。作者提出了一种基于分层总线的共享内存系统的研究工具。他们强调了分层缓存一致性协议的主要特点,并给出了通过指令级模拟器获得的一些初步性能结果。
{"title":"A multi-level hierarchical cache coherence protocol for multiprocessors","authors":"Craig Anderson, J. Baer","doi":"10.1109/IPPS.1993.262871","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262871","url":null,"abstract":"In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, lack of bus bandwidth restricts the number of processors that can be effectively put on a single bus machine. One architecture that has been proposed to solve the limited bandwidth problem consists of processors connected via a tree hierarchy of buses. The authors present a tool to study a hierarchical bus based shared-memory system. They highlight the main features of a hierarchical cache coherence protocol and give some preliminary performance results obtained via an instruction level simulator.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130119150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
The data-parallel Ada run-time system, simulation and empirical results 数据并行Ada运行系统,仿真及实证结果
Pub Date : 1993-04-13 DOI: 10.1109/IPPS.1993.262808
H. G. Mayer, Stefan Jähnichen
The Parallel Ada Run-Time System (PARTS), developed at TUB, is the target of an experimental translator that maps sequential Ada to a shared-memory multi-processor. Other modules of the parallel compiler are not explained. The paper summarizes the multi-processor run-time system; it explains those instructions that activate multiple processors leading to SPMD execution and discusses the scheduling policy Default architectural attributes of PARTS can be custom-tailored for each run without re-compile. The experiments exposed different machine personalities by measuring execution time profiles of the vector product run on different architectures. The goal is to find experimentally, how well a shared-memory architecture scales up to an increasing problem size, and how well the problem size scales up for a fixed multi-processor configuration. The measurements expose the advantages of shared-memory multi-processor architectures to exploit one dimension of parallelism. However, scalability is limited to the number of memory ports. Therefore another architectural dimension of parallelism, distributed-memory, must be combined with shared memories to achieve Tera-FLOP performance.<>
并行Ada运行时系统(PARTS),由TUB开发,是一个实验性转换器的目标,它将顺序Ada映射到共享内存多处理器。并行编译器的其他模块没有解释。本文综述了多处理器运行时系统;它解释了那些激活导致SPMD执行的多个处理器的指令,并讨论了调度策略。PARTS的默认体系结构属性可以为每次运行定制,而无需重新编译。实验通过测量在不同架构上运行的向量积的执行时间曲线来暴露不同的机器特性。我们的目标是通过实验发现,共享内存体系结构在不断增加的问题规模中扩展得有多好,以及在固定的多处理器配置中问题规模扩展得有多好。这些测量揭示了共享内存多处理器架构在利用一维并行性方面的优势。但是,可伸缩性受限于内存端口的数量。因此,并行性的另一个架构维度,分布式内存,必须与共享内存相结合,以实现Tera-FLOP性能。
{"title":"The data-parallel Ada run-time system, simulation and empirical results","authors":"H. G. Mayer, Stefan Jähnichen","doi":"10.1109/IPPS.1993.262808","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262808","url":null,"abstract":"The Parallel Ada Run-Time System (PARTS), developed at TUB, is the target of an experimental translator that maps sequential Ada to a shared-memory multi-processor. Other modules of the parallel compiler are not explained. The paper summarizes the multi-processor run-time system; it explains those instructions that activate multiple processors leading to SPMD execution and discusses the scheduling policy Default architectural attributes of PARTS can be custom-tailored for each run without re-compile. The experiments exposed different machine personalities by measuring execution time profiles of the vector product run on different architectures. The goal is to find experimentally, how well a shared-memory architecture scales up to an increasing problem size, and how well the problem size scales up for a fixed multi-processor configuration. The measurements expose the advantages of shared-memory multi-processor architectures to exploit one dimension of parallelism. However, scalability is limited to the number of memory ports. Therefore another architectural dimension of parallelism, distributed-memory, must be combined with shared memories to achieve Tera-FLOP performance.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
[1993] Proceedings Seventh International Parallel Processing Symposium
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1