首页 > 最新文献

The Sixth Distributed Memory Computing Conference, 1991. Proceedings最新文献

英文 中文
Optimal Total Exchange on an SIMD Distributed-Memory Hypercube SIMD分布式内存超立方体上的最优总交换
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633143
D. Delesalle, D. Trystram, D. Wenzek
This paper deals with optimality results on the implementation of fundamental communication schemes on a distributed-memory SIMD hypercubemultiprocessor (namely, global exchange and personalized global exchange with accumulation). Some experiments are given on a Connection Machine.
本文讨论了在分布式内存SIMD超立方体多处理器上实现基本通信方案(即全局交换和个性化全局累积交换)的最优性结果。在连接机上进行了一些实验。
{"title":"Optimal Total Exchange on an SIMD Distributed-Memory Hypercube","authors":"D. Delesalle, D. Trystram, D. Wenzek","doi":"10.1109/DMCC.1991.633143","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633143","url":null,"abstract":"This paper deals with optimality results on the implementation of fundamental communication schemes on a distributed-memory SIMD hypercubemultiprocessor (namely, global exchange and personalized global exchange with accumulation). Some experiments are given on a Connection Machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114093378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Implementation of the Radix Sorting Algorithm on the Touchstone Delta Prototype 基于Touchstone Delta原型的基数排序算法的实现
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633213
Marc Baber
This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem o
基数排序算法的这种实现将多计算机的节点视为桶,用于接收与其节点标识符对应的密钥。对30位密钥列表进行排序需要在32节点超立方体上进行6次传递,因为每次传递要考虑5位。当桶的数量等于处理器的数量时,会获得超线性的加速,因为除了为每个节点分配更小的数据子集之外,每次传递中考虑更多的比特时,所需的传递次数会减少。通过允许每个处理器有多个bucket,使bucket的数量独立于处理器的数量,这样小的超立方体就可以模拟大的超立方体在每次通过数据时考虑更多位的能力,从而观察到接近线性的真正速度提升。在iPSCl860和Touchstone Delta原型系统上的实验表明,该算法非常适合多计算机体系结构,并且可以很好地扩展密钥的随机分布。基数排序算法对n个长度为m位的密钥的时间复杂度为mO(n)。这种时间复杂度优于大多数流行的O(n log n)算法,因此,基数通常是选择的方法。在并行机器的环境中,只要密钥的分布几乎是平坦的,这一点仍然是正确的。在多计算机上,与直接基数排序[6]相关的开销是它需要多个全对全消息交换。交换的数量可以达到双节点上单个密钥的位数*部分支持:国防高级研究计划局信息科学与技术办公室并发计算系统研究ARPA第6402.6402-1号命令;项目代码8E20和9E20由DARPNCMO根据合同号MDA-972-89-C-0034发布,每个节点有一个桶。在Touchstone Delta原型系统上,使用5个12或29个处理节点,这种直接基数排序的实现在每次传递数据时处理9位,因此一个32位整数在4次传递中完全排序,只需要4次全对全的消息交换。基数算法对键的不均匀分布很敏感。如果密钥的位模式偏离随机、均匀的分布太远,那么一些节点将需要不成比例的内存。在实践中,大多数分布在低阶位上比在高阶位上更具随机性。因此,该实现使用直接基数排序[6],或基数算法的最低有效位数[4]变化,以便将任何负载不平衡推迟到最后一次通过数据。基数交换排序或基数算法的最高有效位数实现只需要一次全对全消息交换,然后在每个节点上进行本地排序,但是由于负载不平衡,该方法可能更容易导致性能下降。在过去几年中,超立方体架构上的排序问题一直是几篇论文的主题。Felten等人[2,31]设计了一种分布式版本的快速排序算法,有时被称为“超快速排序”[9],它利用全局分裂点将键划分为连续较小的子立方体,直到每个键范围存储在单个节点上。由于每个节点存储不同范围的键,因此不需要进行全局合并。在每个节点对其数据应用本地快速排序(或其他顺序排序)之后,排序就完成了。Seidel和George 171研究了三种不同的binsort算法。每种方法首先为每个节点分配一个子键范围(基于假设的均匀分布或在数据样本中观察到的分布),然后将每个节点的数据分解为用于每个其他节点的子集。然后将所有消息同时从初始源发送到最终目的地,在一个步骤中使用超立方体的所有维度。然后,每个节点将快速排序算法应用于其本地子键范围。IEEE Li和Tung[5]在Symult 2010上比较了三种不同排序算法的性能,发现并行快速排序算法在较大的问题规模(大于64p,其中p是处理器01个节点的数量)上的性能优于Bitonic和Shell排序算法。Abali等人[11]在分布式快速排序算法的基础上开发了一种负载均衡的变体,类似于:Seide1和George[7],不同之处是每个节点在将子范围分配给每个节点之前对自己的数据进行快速排序。这允许节点确定最平均地划分数据的确切键。当每个节点从其他节点接收到其子范围中的键的排序包后,对每个节点执行n向合并。Tang[8]实现了一种基于对每个节点进行局部快速排序,然后进行全局Shell合并的排序算法。
{"title":"An Implementation of the Radix Sorting Algorithm on the Touchstone Delta Prototype","authors":"Marc Baber","doi":"10.1109/DMCC.1991.633213","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633213","url":null,"abstract":"This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem o","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121223041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Z-Buffer on a Transputer-Based Machine 基于转发器的机器上的z -缓冲器
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633155
Jian-jin Li, S. Miguet
This paper describes the parallel implementation of the Z-Buffer algorithm on a distributed memory machine. The Z-Buffer is one of the most popular techniques used to generate a representation of a scene consisting of objects in a 3-dimensional world. We propose and compare two different parallel implementations on a network of Transputers. In the first approach, the description of the scene is distributed among the processors configured as a tree. The picture is processed in a pipelined fashion, in order to output parts of the image during the computation of the remainder. In a second approach, both the picture and the scene description are distributed to the processors. interconnected in a ring. We have therefore to redistribute dynamically the tiles among the processors at the beginning of the computation. We show thlat the two approaches are complementary : for small pictures or large scenes, a tree-based algorithm performs better than a ringbased algorithm, but for large pictures or small scenes, it is the other way round. We obtain substantial speedups over the sequential implementation, with up to 32 processors.
本文描述了Z-Buffer算法在分布式存储机上的并行实现。Z-Buffer是最流行的技术之一,用于在三维世界中生成由物体组成的场景的表示。我们提出并比较了两种不同的在计算机网络上的并行实现。在第一种方法中,场景的描述分布在配置为树的处理器之间。图像以流水线方式处理,以便在计算剩余部分时输出图像的部分。在第二种方法中,将图像和场景描述都分发给处理器。在一个环中相互连接的。因此,我们必须在计算开始时在处理器之间动态地重新分配磁片。我们表明这两种方法是互补的:对于小图片或大场景,基于树的算法比基于环的算法表现得更好,但对于大图片或小场景,情况正好相反。使用多达32个处理器,我们获得了比顺序实现更大的速度提升。
{"title":"Z-Buffer on a Transputer-Based Machine","authors":"Jian-jin Li, S. Miguet","doi":"10.1109/DMCC.1991.633155","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633155","url":null,"abstract":"This paper describes the parallel implementation of the Z-Buffer algorithm on a distributed memory machine. The Z-Buffer is one of the most popular techniques used to generate a representation of a scene consisting of objects in a 3-dimensional world. We propose and compare two different parallel implementations on a network of Transputers. In the first approach, the description of the scene is distributed among the processors configured as a tree. The picture is processed in a pipelined fashion, in order to output parts of the image during the computation of the remainder. In a second approach, both the picture and the scene description are distributed to the processors. interconnected in a ring. We have therefore to redistribute dynamically the tiles among the processors at the beginning of the computation. We show thlat the two approaches are complementary : for small pictures or large scenes, a tree-based algorithm performs better than a ringbased algorithm, but for large pictures or small scenes, it is the other way round. We obtain substantial speedups over the sequential implementation, with up to 32 processors.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121436931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Many/370: A Parallel Computer Prototype For I/0 Intensive Applications Many/370:用于I/0密集应用的并行计算机原型
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633364
B. Aball, B.D. Gavril, R. Hadsell, L. Lam, B. Shimamoto
This article is an overview of Many/370, an IBM System/370 parallel processor prototype built JOT I/Ointensive app1ication.s. The prototype consists of 8 processor nodes, 128 small disk drives, and a host c omputer. The nodes h ave a high performance disk I/O capability which distinguishes Many/37O from other multiprocessors. The eight nodes and the host are interconnected By a non-blocking switch, and they corn.tion set. Each node has a disk adaptcr attach,ed to it. The disk adapter has 4 separate SCSI buses and it controls 16 disk d rives. The disk adapter performs the functions a Systetn/370 c hannel and a control unit. municate using e xtensions ol the System/37O I ’ r1 st TU c
本文概述了Many/370,它是IBM System/370并行处理器原型,用于构建JOT I/ o密集型应用程序。原型机由8个处理器节点、128个小磁盘驱动器和一台主机组成。节点具有将Many/37O与其他多处理器区分开来的高性能磁盘I/O能力。8个节点和主机通过非阻塞交换机相互连接,它们相互连接。优化设置。每个节点都附加了一个磁盘适配器。磁盘适配器有4个独立的SCSI总线,它控制16个磁盘驱动器。磁盘适配器执行systemn / 370c通道和控制单元的功能。使用System/37O I ' 1的扩展进行通信
{"title":"Many/370: A Parallel Computer Prototype For I/0 Intensive Applications","authors":"B. Aball, B.D. Gavril, R. Hadsell, L. Lam, B. Shimamoto","doi":"10.1109/DMCC.1991.633364","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633364","url":null,"abstract":"This article is an overview of Many/370, an IBM System/370 parallel processor prototype built JOT I/Ointensive app1ication.s. The prototype consists of 8 processor nodes, 128 small disk drives, and a host c omputer. The nodes h ave a high performance disk I/O capability which distinguishes Many/37O from other multiprocessors. The eight nodes and the host are interconnected By a non-blocking switch, and they corn.tion set. Each node has a disk adaptcr attach,ed to it. The disk adapter has 4 separate SCSI buses and it controls 16 disk d rives. The disk adapter performs the functions a Systetn/370 c hannel and a control unit. municate using e xtensions ol the System/37O I ’ r1 st TU c","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Dataparallel C: A SIMD Programming Language for Multicomputers 数据并行C:用于多计算机的SIMD编程语言
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633095
P. Hatcher, M. J. Quinn, A. Lapadula, R. Anderson, R. R. Jones
Dataparallel C is a SIMD extensiotii to the standard C programming language, It is derived from the original C* language developed by Thinking Machine,r Corporation, We have completed a third-generation Dataparalle1 C compiler, which produces SPMD-style C code suitable for execution on Intel and nCUBE multicomputers. In this paper we discuss the characteristics and strengths of data-parallel programming languages, summarize the syntax and semantics of Dataparallel C', and document the perjbrmance of six benchmark programs executing on the nCUBE 3200 multicomputer. Our work demonstrates that SIMD programs can achieve reasonable speedup when compiled and executed on multicomputers.
dataparnelalc是标准C程序设计语言SIMD的扩展,它是由Thinking Machine,r公司开发的原始C*语言衍生而来,我们已经完成了第三代dataparnelalc编译器,该编译器可以生成适合在Intel和nCUBE多机上执行的spmd风格的C代码。本文讨论了数据并行编程语言的特点和优势,总结了数据并行C语言的语法和语义,并记录了在nCUBE 3200多台计算机上执行的六个基准程序的性能。我们的工作表明,SIMD程序在多台计算机上编译和执行时可以实现合理的加速。
{"title":"Dataparallel C: A SIMD Programming Language for Multicomputers","authors":"P. Hatcher, M. J. Quinn, A. Lapadula, R. Anderson, R. R. Jones","doi":"10.1109/DMCC.1991.633095","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633095","url":null,"abstract":"Dataparallel C is a SIMD extensiotii to the standard C programming language, It is derived from the original C* language developed by Thinking Machine,r Corporation, We have completed a third-generation Dataparalle1 C compiler, which produces SPMD-style C code suitable for execution on Intel and nCUBE multicomputers. In this paper we discuss the characteristics and strengths of data-parallel programming languages, summarize the syntax and semantics of Dataparallel C', and document the perjbrmance of six benchmark programs executing on the nCUBE 3200 multicomputer. Our work demonstrates that SIMD programs can achieve reasonable speedup when compiled and executed on multicomputers.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127686174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Access based data decomposition fam distributed memory machines 分布式内存机中基于访问的数据分解
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633122
J. Ramanujam, P. Sadayappan
This paper addresses the problem of partitioning data for distributed memory machines or multicomputers. If in-suucient attention is paid to the data allocation problem, then the amount of time spent in interprocessor communication might be so high as to seriously undermine the beneets of parallelism. It is therefore worthwhile for a compiler to analyze patterns of data usage to determine allocation, in order to minimize interprocessor communication. We present a matrix notation to describe array accesses in fully parallel loops which lets us derive suu-cient conditions for communication-free decomposition of arrays.
本文讨论了分布式内存机或多计算机的数据分区问题。如果对数据分配问题的关注不够,那么花在处理器间通信上的时间可能会非常多,以至于严重破坏并行性的好处。因此,编译器分析数据使用模式以确定分配是值得的,以便尽量减少处理器间通信。我们提出了一种矩阵符号来描述完全并行循环中的数组访问,这使我们能够推导出无需通信的数组分解的快速条件。
{"title":"Access based data decomposition fam distributed memory machines","authors":"J. Ramanujam, P. Sadayappan","doi":"10.1109/DMCC.1991.633122","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633122","url":null,"abstract":"This paper addresses the problem of partitioning data for distributed memory machines or multicomputers. If in-suucient attention is paid to the data allocation problem, then the amount of time spent in interprocessor communication might be so high as to seriously undermine the beneets of parallelism. It is therefore worthwhile for a compiler to analyze patterns of data usage to determine allocation, in order to minimize interprocessor communication. We present a matrix notation to describe array accesses in fully parallel loops which lets us derive suu-cient conditions for communication-free decomposition of arrays.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129314884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Matrix Multiplication on Hypercubes Using Full Bandwith and Constant Storage 使用全带宽和恒定存储的超立方体上的矩阵乘法
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633211
Ching-Tien Ho, Lennart Johnsson, Alan Edelman
For matrix multiplicatioln on hypercube multiprocessors with the product matrix accumulated in place a processor must receive albout P2/n elements of each input operand, with opeicands of size P x P distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P2/fllog1J for each input operand. We present a two-level partitioning of the matrices and an algolrithm for the matrix: multiplication with optimal data. motion and constant storage. The algorithm has sequential arithmetic complexity 2P3, and parallel arithmetic complexity 2P3/N. The algorithm has been implemented oin the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured iibout 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.
对于在积矩阵累积的超立方多处理器上进行矩阵乘法,处理器必须接收每个输入操作数的大约P2/n个元素,操作数的大小为P × P,均匀分布在n个处理器上。在所有端口上进行并发通信时,每个输入操作数的元素传输顺序可以减少到P2/fllog1J。我们提出了矩阵的两级划分和矩阵的一种算法:最优数据乘法。运动和恒定存储。该算法的顺序算法复杂度为2P3,并行算法复杂度为2P3/N。该算法已在CM-2型连接机中实现。对于8K CM-2的性能,我们测量了大约1.6 Gflops,对于64K的完整机器,这将扩展到大约13 Gflops。
{"title":"Matrix Multiplication on Hypercubes Using Full Bandwith and Constant Storage","authors":"Ching-Tien Ho, Lennart Johnsson, Alan Edelman","doi":"10.1109/DMCC.1991.633211","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633211","url":null,"abstract":"For matrix multiplicatioln on hypercube multiprocessors with the product matrix accumulated in place a processor must receive albout P2/n elements of each input operand, with opeicands of size P x P distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P2/fllog1J for each input operand. We present a two-level partitioning of the matrices and an algolrithm for the matrix: multiplication with optimal data. motion and constant storage. The algorithm has sequential arithmetic complexity 2P3, and parallel arithmetic complexity 2P3/N. The algorithm has been implemented oin the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured iibout 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
When "Grain Size" Doesn't Matter 当“颗粒大小”不重要时
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633317
M. Carter, N. Nayar, J. Gustafson, D. Hoffman, D. Kouri, O. Sharafeddin
We describe insights gained from putting a quantum scattering problem on two very different parallel architectures: MasPar MP-I (massively parallel) and nCUBE 2 (moderately parallel). Our nearly trivial port from the SIMD MasPar to the MIMD nCUBE demonstrates that it is not categorically difficult to move software from one parallel architecture class to another. These machines show widely different processor and problem grain sizes. Their performance is strikingly similar on mal l problems, a fact not predicted by machine grain size, problem grain size, or peak speed comparisons. We introduce a new metric, fixed-time efficiency, that correlates very well with our experiments and has predictive value. Data and control decomposition and communication considerations are analyzed for each machine.
我们描述了将量子散射问题放在两个非常不同的并行架构上所获得的见解:MasPar MP-I(大规模并行)和nCUBE 2(适度并行)。我们从SIMD MasPar到MIMD nCUBE几乎微不足道的移植表明,将软件从一个并行体系结构类移动到另一个并行体系结构类并不是绝对困难的。这些机器显示出不同的处理器和问题颗粒大小。它们在正常问题上的性能惊人地相似,这是机器粒度、问题粒度或峰值速度比较无法预测的事实。我们引入了一个新的度量,固定时间效率,它与我们的实验非常相关,具有预测价值。分析了每台机器的数据和控制分解以及通信考虑因素。
{"title":"When \"Grain Size\" Doesn't Matter","authors":"M. Carter, N. Nayar, J. Gustafson, D. Hoffman, D. Kouri, O. Sharafeddin","doi":"10.1109/DMCC.1991.633317","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633317","url":null,"abstract":"We describe insights gained from putting a quantum scattering problem on two very different parallel architectures: MasPar MP-I (massively parallel) and nCUBE 2 (moderately parallel). Our nearly trivial port from the SIMD MasPar to the MIMD nCUBE demonstrates that it is not categorically difficult to move software from one parallel architecture class to another. These machines show widely different processor and problem grain sizes. Their performance is strikingly similar on mal l problems, a fact not predicted by machine grain size, problem grain size, or peak speed comparisons. We introduce a new metric, fixed-time efficiency, that correlates very well with our experiments and has predictive value. Data and control decomposition and communication considerations are analyzed for each machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131155928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Domain Decomposition and Incomplete Factorisation Methods for Partial Differential Equations 偏微分方程的区域分解与不完全因子分解方法
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633166
C. Christara
In this paper we develop and study a method which tries to combine the merits of Domain Decompoxition (DD) and Incomplete Cholesky preconditioned Con,iugate Gradient method (ICCG) for the parallel solution of linear elliptic Partial Differential Equations (PDEs) on rectangular domains. We frst discretise the PDE problem, using Spline Collocation, a method of Finite Element type based on smooth splines. This gives rise to a sparse linear system of equations. The ICCG method provides us with a very effient, but not straightfarward parallelisable linear solver for such systems. On the (other hand, DD methods are very effective for elliptic PD.Es. A combination of DD and ICCG methods, in which the subdomain solves are carried out with ICCG, leads to eflcient and highly parallelisable solvers. We implement this hybrid DD-ICCG method on a hypercube, discuss its parallel eflciency, and show results from expieriments on configurations with up to 32 processors. We apply a totally local communication scheme and discuss its performance on the iPSCI2 hypercube. A similsrr approach can be used with other PDE discretisation methods.
本文提出并研究了一种结合区域分解(DD)和不完全Cholesky预条件共轭梯度法(ICCG)优点的求解矩形区域上线性椭圆型偏微分方程并行解的方法。我们首先使用基于光滑样条的有限元类型样条配置方法对PDE问题进行离散化。这就产生了一个稀疏线性方程组。ICCG方法为我们提供了一个非常有效的,但不是直接并行的线性求解器。另一方面,DD方法对于椭圆型偏微分方程是非常有效的。将DD和ICCG方法相结合,利用ICCG进行子域求解,得到了高效且高度并行的求解器。我们在一个超立方体上实现了这种混合DD-ICCG方法,讨论了它的并行效率,并给出了在多达32个处理器配置下的实验结果。本文提出了一种全局部通信方案,并讨论了该方案在iPSCI2超立方体上的性能。类似的方法也可用于其它偏微分方程离散化方法。
{"title":"Domain Decomposition and Incomplete Factorisation Methods for Partial Differential Equations","authors":"C. Christara","doi":"10.1109/DMCC.1991.633166","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633166","url":null,"abstract":"In this paper we develop and study a method which tries to combine the merits of Domain Decompoxition (DD) and Incomplete Cholesky preconditioned Con,iugate Gradient method (ICCG) for the parallel solution of linear elliptic Partial Differential Equations (PDEs) on rectangular domains. We frst discretise the PDE problem, using Spline Collocation, a method of Finite Element type based on smooth splines. This gives rise to a sparse linear system of equations. The ICCG method provides us with a very effient, but not straightfarward parallelisable linear solver for such systems. On the (other hand, DD methods are very effective for elliptic PD.Es. A combination of DD and ICCG methods, in which the subdomain solves are carried out with ICCG, leads to eflcient and highly parallelisable solvers. We implement this hybrid DD-ICCG method on a hypercube, discuss its parallel eflciency, and show results from expieriments on configurations with up to 32 processors. We apply a totally local communication scheme and discuss its performance on the iPSCI2 hypercube. A similsrr approach can be used with other PDE discretisation methods.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125040106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Domain Decomposition to Solve Positive-Definite Systems on the Hypercube Computer 利用区域分解在超立方体计算机上求解正定系统
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633214
G.L. Hennigan, S. Castillo, E. Hensel
A distributed method of solving sparse, positive-definite systems of equations on a hypercube computer, like those arising fiom many finite-element problems, is studied. A domain decomposition method is introduced wherein the domain of the problem to be solved is physically split into several sub-domains. This physical split is based on an ordering known as one-way dissection [ I ] . The one-way dissection ordering generates a block-diagonal system of equations which is well suited to a parallel implementation. Once the ordering has been accomplished each of the subdomains is then distributed to a processor in the hypercube computer as necessary. The method is applied to two-dimensional electrostatic problems which are governed by Laplace’s equation. Since the finite-element method is used to discretize the problem the method is developed to take full advantage of the inherent sparsity. The algorithm is applied to several geometries.
研究了一种在超立方体计算机上求解稀疏正定方程组的分布式方法,类似于许多有限元问题。提出了一种域分解方法,将待解问题的域物理划分为若干子域。这种物理分裂是基于一种被称为单向解剖的顺序。单向剖分排序产生一个适合并行实现的块对角线方程组。排序完成后,根据需要将每个子域分发给超立方体计算机中的处理器。将该方法应用于由拉普拉斯方程控制的二维静电问题。由于采用有限元方法对问题进行离散化,因此该方法充分利用了其固有的稀疏性。该算法应用于几种几何图形。
{"title":"Using Domain Decomposition to Solve Positive-Definite Systems on the Hypercube Computer","authors":"G.L. Hennigan, S. Castillo, E. Hensel","doi":"10.1109/DMCC.1991.633214","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633214","url":null,"abstract":"A distributed method of solving sparse, positive-definite systems of equations on a hypercube computer, like those arising fiom many finite-element problems, is studied. A domain decomposition method is introduced wherein the domain of the problem to be solved is physically split into several sub-domains. This physical split is based on an ordering known as one-way dissection [ I ] . The one-way dissection ordering generates a block-diagonal system of equations which is well suited to a parallel implementation. Once the ordering has been accomplished each of the subdomains is then distributed to a processor in the hypercube computer as necessary. The method is applied to two-dimensional electrostatic problems which are governed by Laplace’s equation. Since the finite-element method is used to discretize the problem the method is developed to take full advantage of the inherent sparsity. The algorithm is applied to several geometries.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115594556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
The Sixth Distributed Memory Computing Conference, 1991. Proceedings
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1