首页 > 最新文献

The Sixth Distributed Memory Computing Conference, 1991. Proceedings最新文献

英文 中文
Performance and Assembly Language Programming of the iPSC/860 System iPSC/860系统的性能与汇编语言编程
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633312
D. Scott, G. Withers
In the world of supercomputers, the goal is higher and higher performance. To obtain the highest performance on a particular computational kernel, it is usually necessary to write assembly language. Compiler technology has not matured to the point of being able to take advantage many of the features of the i860 automatically. To approach the peak performance of the chip, it is currently necessary to use custom assembly language code. It is important to know which combinations of assembly instructions offer the highest performance, and which combinations cannot run at full speed. This paper assumes that you are already acquainted with the basics of the i860 microprocessor assembly language, and concentrates on describing how to enhance the performance of your code using 860 assembly language.
在超级计算机的世界里,目标是越来越高的性能。为了在特定的计算内核上获得最高的性能,通常需要编写汇编语言。编译器技术还没有成熟到能够自动利用i860的许多特性的程度。为了接近芯片的峰值性能,目前有必要使用自定义的汇编语言代码。重要的是要知道哪些汇编指令的组合能提供最高的性能,哪些组合不能全速运行。本文假设您已经熟悉i860微处理器汇编语言的基础知识,并集中描述如何使用860汇编语言增强代码的性能。
{"title":"Performance and Assembly Language Programming of the iPSC/860 System","authors":"D. Scott, G. Withers","doi":"10.1109/DMCC.1991.633312","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633312","url":null,"abstract":"In the world of supercomputers, the goal is higher and higher performance. To obtain the highest performance on a particular computational kernel, it is usually necessary to write assembly language. Compiler technology has not matured to the point of being able to take advantage many of the features of the i860 automatically. To approach the peak performance of the chip, it is currently necessary to use custom assembly language code. It is important to know which combinations of assembly instructions offer the highest performance, and which combinations cannot run at full speed. This paper assumes that you are already acquainted with the basics of the i860 microprocessor assembly language, and concentrates on describing how to enhance the performance of your code using 860 assembly language.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116785116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Characterizing the Balance of Parallel 1/0 Systems 描述并行1/0系统的平衡
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633363
J. French
High pzrformance 110 subsystems are a key element in parallel computer systems that intend to compete with traditional supercomputers in the solution of large scientific and engineering problems. The hardware and software organization and perfomiance of these U0 subsystems are fundamental issues in parallel computer systems that have not been explored in depth. Two commercially available parallel file systems are the Intel iPSC/2 Concurrent File System (CFS) and the NCUBE/ten NChannel board and disk farm. Both systcms are aimed at support of high volume, large block I/O of the type typical of large scientific computations. The evaluation of these systems has proved difficult. There are many parameters affecting performance and the system dynamics are quite complex. In this paper we examine a method of quantifying the balance of an 1/0 system, that is, how well it services I/O rcquests with respect to fairness and distribution of overheads. One may gauge the degree of balance in a systcrn by asking: When resources become saturated, is the bottleneck felt equally by each process or are some processes given preferential service? This paper explores a simple yardstick of system balance. 1. Quantifying and Measuring Parallel I/O Suppose that we have p processes reading (writing) a file of N bytes in parallel. Each process i reads (writes) N n bytes in time ti where n = -. The individual data P transfer rate of a particular processor i is given by ri = z. The average individual data transfer rate is ti given by 7 = 1 firi. P I = There are at least two reasonable measures of the aggregate data transfer rate of the p processors. In the first case, we sum the data rates of the individual processors. This gives rise to the quantity ri called the maximum sustained aggregate rate (ma-SAR). We call this I = 4 tThis research was supported in part by JPL Contract #957721 and by the Department of Energy under Grant DE-FG05-88ER25063. measure the “maximum” rate because, by construction, it assumes that each processor i contributes a rate ri and all processors contribute during the same time instant, however brief. From the definition of F above, we see that max-SAR = ri = p 7 . (1) I = f i This interpretation is illustrated in Figure l(a). In the second case, we consider that all N bytes move through the system in z = max ti time units. That is, the entire file is not transferred until the slowest processor finishes reading (writing) its partition of the file. This gives rise to the quantity called the minimum sustained aggregate rate (min-SAR). We call this a “minimum” rate since this is the rate that an outside observer will perceive as the rate at which the entire processor ensemble is operating. From the definitions above, we see that 1
高性能子系统是并行计算机系统在解决大型科学和工程问题方面与传统超级计算机竞争的关键因素。这些U0子系统的硬件和软件组织和性能是并行计算机系统中尚未深入探讨的基本问题。两个商业上可用的并行文件系统是英特尔iPSC/2并发文件系统(CFS)和NCUBE/ 10nchannel板和磁盘场。这两个系统都旨在支持典型的大型科学计算的高容量、大块I/O。对这些系统的评估已证明是困难的。影响系统性能的参数很多,系统动力学非常复杂。在本文中,我们研究了一种量化1/0系统平衡的方法,也就是说,它在公平和开销分配方面为I/O请求提供服务的情况。可以通过以下问题来衡量系统中的平衡程度:当资源饱和时,每个进程是否都能感受到瓶颈,或者某些进程是否获得了优先服务?本文探讨了系统平衡的一个简单尺度。1. 量化和测量并行I/O假设我们有p个进程并行读(写)一个N字节的文件。每个进程i在时间ti上读(写)N个字节,其中N = -。特定处理器i的单个数据P传输速率由ri = z给出,平均单个数据传输速率由7 = 1 firi给出。至少有两种合理的方法来衡量P个处理器的总数据传输速率。在第一种情况下,我们对各个处理器的数据速率求和。这就产生了称为最大持续聚合速率(ma-SAR)的量ri。我们称之为I = 4。这项研究得到了喷气推进实验室合同编号957721和能源部在DE-FG05-88ER25063拨款下的部分支持。测量“最大”速率,因为根据构造,它假设每个处理器I贡献速率ri,并且所有处理器在同一瞬间(无论多么短暂)贡献速率。由上面F的定义可知,max-SAR = ri = p 7。(1) I = f I这种解释如图1 (a)所示。在第二种情况下,我们认为所有N个字节以z = max ti时间单位在系统中移动。也就是说,在最慢的处理器完成对文件分区的读(写)操作之前,不会传输整个文件。这就产生了称为最小持续累计速率(min-SAR)的量。我们称其为“最小”速率,因为外部观察者会认为这是整个处理器集合运行的速率。从上面的定义,我们看到1
{"title":"Characterizing the Balance of Parallel 1/0 Systems","authors":"J. French","doi":"10.1109/DMCC.1991.633363","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633363","url":null,"abstract":"High pzrformance 110 subsystems are a key element in parallel computer systems that intend to compete with traditional supercomputers in the solution of large scientific and engineering problems. The hardware and software organization and perfomiance of these U0 subsystems are fundamental issues in parallel computer systems that have not been explored in depth. Two commercially available parallel file systems are the Intel iPSC/2 Concurrent File System (CFS) and the NCUBE/ten NChannel board and disk farm. Both systcms are aimed at support of high volume, large block I/O of the type typical of large scientific computations. The evaluation of these systems has proved difficult. There are many parameters affecting performance and the system dynamics are quite complex. In this paper we examine a method of quantifying the balance of an 1/0 system, that is, how well it services I/O rcquests with respect to fairness and distribution of overheads. One may gauge the degree of balance in a systcrn by asking: When resources become saturated, is the bottleneck felt equally by each process or are some processes given preferential service? This paper explores a simple yardstick of system balance. 1. Quantifying and Measuring Parallel I/O Suppose that we have p processes reading (writing) a file of N bytes in parallel. Each process i reads (writes) N n bytes in time ti where n = -. The individual data P transfer rate of a particular processor i is given by ri = z. The average individual data transfer rate is ti given by 7 = 1 firi. P I = There are at least two reasonable measures of the aggregate data transfer rate of the p processors. In the first case, we sum the data rates of the individual processors. This gives rise to the quantity ri called the maximum sustained aggregate rate (ma-SAR). We call this I = 4 tThis research was supported in part by JPL Contract #957721 and by the Department of Energy under Grant DE-FG05-88ER25063. measure the “maximum” rate because, by construction, it assumes that each processor i contributes a rate ri and all processors contribute during the same time instant, however brief. From the definition of F above, we see that max-SAR = ri = p 7 . (1) I = f i This interpretation is illustrated in Figure l(a). In the second case, we consider that all N bytes move through the system in z = max ti time units. That is, the entire file is not transferred until the slowest processor finishes reading (writing) its partition of the file. This gives rise to the quantity called the minimum sustained aggregate rate (min-SAR). We call this a “minimum” rate since this is the rate that an outside observer will perceive as the rate at which the entire processor ensemble is operating. From the definitions above, we see that 1","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115500873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Routing between Subcubes in a Hypercube 超多维数据集中的子数据集之间的路由
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633149
S. Padmanabhan
As hypercube sizes and node capabilities increase, applications such as database processing and task management which utilize parallelism within a task and between tasks are becoming important. These applications require a new routing paradigm, where data (or programs) residing an a subcube are transferred to another subcube. In this paper, we describe and analyze an algorithm for routing data from the nodes of a k-dimension subcube to the nodes of any other kdimension subcube in the hypercube. We show that the algorithm enables data transfer between the iwo subcubes to be performed optimally in the current generatie:% direct-connect hypercubes. Also, the algorithm is very simple and can be executed in @(n) steps, where n is the dimension of the hypercube.
随着超多维数据集大小和节点功能的增加,利用任务内部和任务之间并行性的数据库处理和任务管理等应用程序变得越来越重要。这些应用程序需要一种新的路由范例,其中驻留在子多维数据集中的数据(或程序)被传输到另一个子多维数据集。本文描述并分析了一种将数据从k维子立方体的节点路由到超立方体中任何其他k维子立方体的节点的算法。我们证明,该算法使两个子数据集之间的数据传输在当前生成的%直连超数据集中得到最佳执行。此外,该算法非常简单,可以以@(n)步执行,其中n是超立方体的维度。
{"title":"Routing between Subcubes in a Hypercube","authors":"S. Padmanabhan","doi":"10.1109/DMCC.1991.633149","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633149","url":null,"abstract":"As hypercube sizes and node capabilities increase, applications such as database processing and task management which utilize parallelism within a task and between tasks are becoming important. These applications require a new routing paradigm, where data (or programs) residing an a subcube are transferred to another subcube. In this paper, we describe and analyze an algorithm for routing data from the nodes of a k-dimension subcube to the nodes of any other kdimension subcube in the hypercube. We show that the algorithm enables data transfer between the iwo subcubes to be performed optimally in the current generatie:% direct-connect hypercubes. Also, the algorithm is very simple and can be executed in @(n) steps, where n is the dimension of the hypercube.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126877600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Communication Primitives on Circuit-Switched Hypercubes 电路交换超立方体上的高效通信原语
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633172
Ching-Tien Ho, M. Raghunath
We give practical algorithms, complexity analysis and implementation for all-to-all personalized communication and matrix transpose (with two-dimensional partitioning of the matrix) on hypercubes. We assume the following communication characteristics: circuitswitched e-cube routing and one-port communication model. For all-to-all personalized communication, we propose a hybrid algorithm that combines the well-known recursive doubling algorithm [22,12] and a direct-route algorithm [26,23]. Our hybrid algorithm balances between data transfer time and start-up time of these two algorithms, and its communication complexity is estimated to be better than the two previous algorithms for a range of machine parameters. For matrix transpose with two-dimensional partitioning of the matrix, our algorithm is measured to be better than the recursive transpose algorithm [8] by n nearest-neighbor communications [12]. Our algorithm takes advantage of circuit-switched routing and is congestion-free for a hypercube with e-cube routing. We also suggest a way of storing the matrix such that the transpose operation can take advantage of the routing of the machine.
我们给出了实用的算法、复杂性分析和实现,用于所有对所有的个性化通信和超立方体上的矩阵转置(与矩阵的二维划分)。我们假设以下通信特性:电路交换的e-cube路由和单端口通信模型。对于所有对所有的个性化通信,我们提出了一种混合算法,该算法结合了众所周知的递归加倍算法[22,12]和直接路由算法[26,23]。我们的混合算法平衡了这两种算法的数据传输时间和启动时间,并且在一定的机器参数范围内,估计其通信复杂度优于前两种算法。对于矩阵进行二维划分的矩阵转置,通过n次最近邻通信[12],我们的算法优于递归转置算法[8]。我们的算法利用了电路交换路由的优势,对于具有e-cube路由的超立方体来说是无拥塞的。我们还提出了一种存储矩阵的方法,使转置操作可以利用机器的路由。
{"title":"Efficient Communication Primitives on Circuit-Switched Hypercubes","authors":"Ching-Tien Ho, M. Raghunath","doi":"10.1109/DMCC.1991.633172","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633172","url":null,"abstract":"We give practical algorithms, complexity analysis and implementation for all-to-all personalized communication and matrix transpose (with two-dimensional partitioning of the matrix) on hypercubes. We assume the following communication characteristics: circuitswitched e-cube routing and one-port communication model. For all-to-all personalized communication, we propose a hybrid algorithm that combines the well-known recursive doubling algorithm [22,12] and a direct-route algorithm [26,23]. Our hybrid algorithm balances between data transfer time and start-up time of these two algorithms, and its communication complexity is estimated to be better than the two previous algorithms for a range of machine parameters. For matrix transpose with two-dimensional partitioning of the matrix, our algorithm is measured to be better than the recursive transpose algorithm [8] by n nearest-neighbor communications [12]. Our algorithm takes advantage of circuit-switched routing and is congestion-free for a hypercube with e-cube routing. We also suggest a way of storing the matrix such that the transpose operation can take advantage of the routing of the machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121069900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Mapping Precedence-Constrained Simulation Tasks for a Parallel Environment 并行环境下映射优先约束仿真任务
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633067
J. Sartor, G. Lamont, R. Hammell, T. Hartrum
The Mapping Problem Classical results on the deterministic precedence- constrained scheduling problem are almost exclusively concerned with a single iteration of the task system. This paper explores the problem of mapping deter- ministic tasks to processors in a parallel simulation environment, with each task iterating multiple times. Counterexamples are shown to demonstrate that mul- tiple passes through an optimal mapping for one iter- ation of a task system may produce less-than-optimal results when compared to mappings based on the it- erative nature of the simulation. A level strategy for assigning iterative tasks to processors is developed, and theoretical and experimental results are discussed for different mapping strategies in a VHDL simulation. This paper examines the classical multiprocessor scheduling problem for application to deterministic simulation systems. The tasks in these systems are characterized by iterative executions: each task exe- cutes more than once in the course of a simulation run. The general task scheduling problem and its relation- ship to the mapping problem for simulation tasks are introduced. The problem space is constrained, lim- iting the scope of the study to systems which map equal-execution time tasks into identical processors. A theoretical basis for the level strategy of iterative task assignment is summarized, and a polynomial- time algorithm based on this strategy is given. The results of hypercube experiments based on different mapping strategies are discussed with application to VHDL logic simulation.
映射问题确定性优先约束调度问题的经典结果几乎只涉及任务系统的单次迭代。本文探讨了并行仿真环境中,在每个任务迭代多次的情况下,将多任务映射到处理器的问题。反例表明,与基于模拟的it生成特性的映射相比,多次通过任务系统的一次迭代的最优映射可能产生不如最优的结果。提出了一种将迭代任务分配给处理器的层次策略,并讨论了VHDL仿真中不同映射策略的理论和实验结果。本文研究了应用于确定性仿真系统的经典多处理机调度问题。这些系统中的任务以迭代执行为特征:每个任务在模拟运行过程中执行不止一次。介绍了一般任务调度问题及其与仿真任务映射问题的关系。问题空间是有限的,限制了研究的范围,将相同执行时间的任务映射到相同处理器的系统。总结了迭代任务分配分层策略的理论基础,给出了基于分层策略的多项式时间算法。讨论了基于不同映射策略的超立方体实验结果,并将其应用于VHDL逻辑仿真。
{"title":"Mapping Precedence-Constrained Simulation Tasks for a Parallel Environment","authors":"J. Sartor, G. Lamont, R. Hammell, T. Hartrum","doi":"10.1109/DMCC.1991.633067","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633067","url":null,"abstract":"The Mapping Problem Classical results on the deterministic precedence- constrained scheduling problem are almost exclusively concerned with a single iteration of the task system. This paper explores the problem of mapping deter- ministic tasks to processors in a parallel simulation environment, with each task iterating multiple times. Counterexamples are shown to demonstrate that mul- tiple passes through an optimal mapping for one iter- ation of a task system may produce less-than-optimal results when compared to mappings based on the it- erative nature of the simulation. A level strategy for assigning iterative tasks to processors is developed, and theoretical and experimental results are discussed for different mapping strategies in a VHDL simulation. This paper examines the classical multiprocessor scheduling problem for application to deterministic simulation systems. The tasks in these systems are characterized by iterative executions: each task exe- cutes more than once in the course of a simulation run. The general task scheduling problem and its relation- ship to the mapping problem for simulation tasks are introduced. The problem space is constrained, lim- iting the scope of the study to systems which map equal-execution time tasks into identical processors. A theoretical basis for the level strategy of iterative task assignment is summarized, and a polynomial- time algorithm based on this strategy is given. The results of hypercube experiments based on different mapping strategies are discussed with application to VHDL logic simulation.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133487266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ALDIMS - A Language for Programming Distributed Memory Multiprocessors 分布式内存多处理器编程语言
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633132
K. G. Kumar, D. Kulkarni, A. Basu, A. Paulraj
In this paper we present ALDIMS, a language that combines the expressibility of general functional (MIMD) parallelism with compact expressibility of data (SPMD) parallelism. It uses distributed data structures for specifying data partitions and single assignment variables as abstract means of inter-process communication. Constructs for unstructured parallelism and process placement specifications make general MIMD parallelism expressible. We describe the issues of implementing process invocation and communication primitive generation. We also discuss source level parallelization and optimization issues and strategies.
本文提出了一种将通用函数并行性(MIMD)的可表达性与数据并行性(SPMD)的紧凑可表达性相结合的语言ALDIMS。它使用分布式数据结构来指定数据分区,并使用单个赋值变量作为进程间通信的抽象手段。非结构化并行性的构造和进程放置规范使一般的MIMD并行性可表达。我们描述了实现流程调用和通信原语生成的问题。我们还讨论了源级并行化和优化问题和策略。
{"title":"ALDIMS - A Language for Programming Distributed Memory Multiprocessors","authors":"K. G. Kumar, D. Kulkarni, A. Basu, A. Paulraj","doi":"10.1109/DMCC.1991.633132","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633132","url":null,"abstract":"In this paper we present ALDIMS, a language that combines the expressibility of general functional (MIMD) parallelism with compact expressibility of data (SPMD) parallelism. It uses distributed data structures for specifying data partitions and single assignment variables as abstract means of inter-process communication. Constructs for unstructured parallelism and process placement specifications make general MIMD parallelism expressible. We describe the issues of implementing process invocation and communication primitive generation. We also discuss source level parallelization and optimization issues and strategies.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132614714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Static Program Assignment in Circuit Switched Multiprocessors 电路开关多处理器中的静态程序分配
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633136
J. Lindberg
Exploiting the performance of distributed memory multiprocessors necessitates efficient algorithms for assigning concurrently executable program tasks to processors the well known mapping problem. Faditionally, solutions to the mapping problem have been based on a model of inter-processor communication where communication cost increases linearly with distance, and it is this cost that is the principal determinant ofper$ormance. Therefore, most existing algorithms attempt to find assignments that minimize the graph theoretic distance between communicating processors. In circuit switched multiprocessors, it is typically circuit blocking and not the inter-processor communication latency that dominates. We propose the use of an adaptive variant of simulated annealing to search for an acceptable assignment. This algorithm is useful for determining assignments for multiprocessor architectures implementing both the circuit switched and store-and-forward model of inter-processor communication.
利用分布式内存多处理器的性能需要有效的算法来将并发可执行程序任务分配给处理器,这就是众所周知的映射问题。传统上,映射问题的解决方案基于处理器间通信模型,其中通信成本随距离线性增加,并且这种成本是性能的主要决定因素。因此,大多数现有算法都试图找到最小化通信处理器之间图论距离的分配。在电路交换多处理器中,通常是电路阻塞而不是处理器间通信延迟占主导地位。我们建议使用模拟退火的自适应变体来搜索可接受的分配。该算法对于实现电路交换和处理器间通信存储转发模型的多处理器体系结构的分配是有用的。
{"title":"Static Program Assignment in Circuit Switched Multiprocessors","authors":"J. Lindberg","doi":"10.1109/DMCC.1991.633136","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633136","url":null,"abstract":"Exploiting the performance of distributed memory multiprocessors necessitates efficient algorithms for assigning concurrently executable program tasks to processors the well known mapping problem. Faditionally, solutions to the mapping problem have been based on a model of inter-processor communication where communication cost increases linearly with distance, and it is this cost that is the principal determinant ofper$ormance. Therefore, most existing algorithms attempt to find assignments that minimize the graph theoretic distance between communicating processors. In circuit switched multiprocessors, it is typically circuit blocking and not the inter-processor communication latency that dominates. We propose the use of an adaptive variant of simulated annealing to search for an acceptable assignment. This algorithm is useful for determining assignments for multiprocessor architectures implementing both the circuit switched and store-and-forward model of inter-processor communication.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131118851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Parallel Approach To Solving a 3-D Finite Element Problem on a Distributed Memory MIMD Machine 分布式存储器MIMD机三维有限元问题的并行求解方法
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633157
A. Amin, A. Chaudhary, P. Sadayappan
'A three-dimensional nonlinear rigid-viscoplastic metal forming finite element package, ALPID-3D, is being developed io run on distributed-memory MZMD parallel computers. Efficient parallelization of the applicarion requires identification and efficient mapping of the compute intensive part of the finite jlement code on the parallel machine. This primarily includes the generation and solution of finite element matrix governing equations within each nonlinear iteration. The Element By Element Preconditioned Conjugate Gradient (EBE-PCG) method is used for solving the finite element matrix equations. An approach to minimizing the communication overhead during the EBE-PCG iterations and timing results are presented.
一个三维非线性刚粘塑性金属成形有限元包ALPID-3D正在开发中,它可以在分布式内存MZMD并行计算机上运行。应用程序的有效并行化需要在并行机上识别和有效地映射计算密集部分的有限元代码。这主要包括在每次非线性迭代中生成和求解有限元矩阵控制方程。采用逐元预条件共轭梯度法求解有限元矩阵方程。提出了一种最小化EBE-PCG迭代过程中的通信开销和定时结果的方法。
{"title":"A Parallel Approach To Solving a 3-D Finite Element Problem on a Distributed Memory MIMD Machine","authors":"A. Amin, A. Chaudhary, P. Sadayappan","doi":"10.1109/DMCC.1991.633157","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633157","url":null,"abstract":"'A three-dimensional nonlinear rigid-viscoplastic metal forming finite element package, ALPID-3D, is being developed io run on distributed-memory MZMD parallel computers. Efficient parallelization of the applicarion requires identification and efficient mapping of the compute intensive part of the finite jlement code on the parallel machine. This primarily includes the generation and solution of finite element matrix governing equations within each nonlinear iteration. The Element By Element Preconditioned Conjugate Gradient (EBE-PCG) method is used for solving the finite element matrix equations. An approach to minimizing the communication overhead during the EBE-PCG iterations and timing results are presented.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133865075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multidimensional Spreadsheets in a Graphical Symbolic Debugger for the Ncube 多维电子表格中的图形符号调试器的Ncube
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633125
A. Couch, D.W. Krumme
We consider the problem of data presentation in a command-oriented debugger for a distributed system. We present an extension of the command syntax found in most serial and several parallel debuggers, whereby a spreadsheet of textual information may be constructed from data of varying types obtained from distributed locations. This spreadsheet is presented to the user in a window which is scrollable in four independent dimensions under keypad control. This extension is implemented in the Seeplane debugger for the Nculie/2.
我们考虑了分布式系统中面向命令的调试器中的数据表示问题。我们提供了在大多数串行和几个并行调试器中发现的命令语法的扩展,通过该扩展,可以从从分布位置获得的不同类型的数据构建文本信息的电子表格。该电子表格在一个窗口中呈现给用户,该窗口在键盘控制下可在四个独立的维度上滚动。这个扩展是在为Nculie/2的Seeplane调试器实现的。
{"title":"Multidimensional Spreadsheets in a Graphical Symbolic Debugger for the Ncube","authors":"A. Couch, D.W. Krumme","doi":"10.1109/DMCC.1991.633125","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633125","url":null,"abstract":"We consider the problem of data presentation in a command-oriented debugger for a distributed system. We present an extension of the command syntax found in most serial and several parallel debuggers, whereby a spreadsheet of textual information may be constructed from data of varying types obtained from distributed locations. This spreadsheet is presented to the user in a window which is scrollable in four independent dimensions under keypad control. This extension is implemented in the Seeplane debugger for the Nculie/2.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133792901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Processor-Time Tradeoffs for Cayley Graph Interconnection Networks Cayley图互连网络的处理器时间权衡
Pub Date : 1991-04-28 DOI: 10.1109/DMCC.1991.633348
Marc Baumslagt, A. Rosenberg
We show that every processor array whose interconnection network is based on a Cayley graph of nonso that the graph's underlying group has a nontrivia size subgroup) can be emulated in a workpreserving manner, on general computations, by a (smaller) quotient array. If the underlying group has nontrivial snbgroups of several orders, one thus can choose among several matchups of time and hardware requirements. Our emulations gain efficiency when additional structural uniformity is present.
我们展示了每个处理器阵列,其互连网络基于非琐碎的Cayley图(图的底层组具有非琐碎大小子组),可以在一般计算中通过(较小的)商阵列以保持工作的方式进行模拟。如果底层组具有多个顺序的非平凡snbgroup,则可以在时间和硬件需求的几种匹配中进行选择。当存在额外的结构均匀性时,我们的仿真提高了效率。
{"title":"Processor-Time Tradeoffs for Cayley Graph Interconnection Networks","authors":"Marc Baumslagt, A. Rosenberg","doi":"10.1109/DMCC.1991.633348","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633348","url":null,"abstract":"We show that every processor array whose interconnection network is based on a Cayley graph of nonso that the graph's underlying group has a nontrivia size subgroup) can be emulated in a workpreserving manner, on general computations, by a (smaller) quotient array. If the underlying group has nontrivial snbgroups of several orders, one thus can choose among several matchups of time and hardware requirements. Our emulations gain efficiency when additional structural uniformity is present.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132642884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
The Sixth Distributed Memory Computing Conference, 1991. Proceedings
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1