首页 > 最新文献

Proceedings of the IEEE/ACM SC95 Conference最新文献

英文 中文
Parallel Implementations of the Power System Transient Stability Problem on Clusters of Workstations 电力系统暂态稳定问题在工作站集群上的并行实现
Pub Date : 1995-12-08 DOI: 10.1145/224170.224279
M. T. Bruggencate, S. Chalasani
Power system transient stability analysis computes the response of the rapidly changing electrical components of a power system to a sequence of large disturbances followed by operations to protect the system against the disturbances. Transient stability analysis involves repeatedly solving large, very sparse, time varying non-linear systems over thousands of time steps. In this paper, we present parallel implementations of the transient stability problem in which we use direct methods to solve the linearized systems. One method uses factorization and forward and backward substitution to solve the linear systems. Another method, known as the W-Matrix method, uses factorization and partitioning to increase the amount of parallelism during the solution phase. The third method, the Repeated Substitution method, uses factorization and computations which can be done ahead of time to further increase the amount of parallelism during the solution phase. We discuss the performance of the different methods implemented on a loosely coupled, heterogeneous network of workstations (NOW) and the SP2 cluster of workstations.
电力系统暂态稳定分析计算电力系统中快速变化的电气元件对一系列大扰动的响应,随后采取措施保护系统免受干扰。暂态稳定性分析涉及在数千个时间步长上反复求解大型、非常稀疏、时变的非线性系统。在本文中,我们提出了暂态稳定问题的并行实现,其中我们使用直接方法来求解线性化系统。一种方法是利用因式分解和前后代换来求解线性方程组。另一种方法,称为W-Matrix方法,使用因式分解和分区来增加求解阶段的并行性。第三种方法是重复代换法,它使用可提前完成的因式分解和计算来进一步增加求解阶段的并行性。我们讨论了在松散耦合的异构工作站网络(NOW)和SP2工作站集群上实现的不同方法的性能。
{"title":"Parallel Implementations of the Power System Transient Stability Problem on Clusters of Workstations","authors":"M. T. Bruggencate, S. Chalasani","doi":"10.1145/224170.224279","DOIUrl":"https://doi.org/10.1145/224170.224279","url":null,"abstract":"Power system transient stability analysis computes the response of the rapidly changing electrical components of a power system to a sequence of large disturbances followed by operations to protect the system against the disturbances. Transient stability analysis involves repeatedly solving large, very sparse, time varying non-linear systems over thousands of time steps. In this paper, we present parallel implementations of the transient stability problem in which we use direct methods to solve the linearized systems. One method uses factorization and forward and backward substitution to solve the linear systems. Another method, known as the W-Matrix method, uses factorization and partitioning to increase the amount of parallelism during the solution phase. The third method, the Repeated Substitution method, uses factorization and computations which can be done ahead of time to further increase the amount of parallelism during the solution phase. We discuss the performance of the different methods implemented on a loosely coupled, heterogeneous network of workstations (NOW) and the SP2 cluster of workstations.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114732747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Surveying Molecular Interactions with DOT 用DOT测量分子相互作用
Pub Date : 1995-12-08 DOI: 10.1145/224170.224218
L. T. Eyck, J. Mandell, V. Roberts, M. Pique
The purpose of the molecular interaction program DOT (Daughter of Turnip) is rapid computation of the electrostatic potential energy between two proteins or other charged molecules. DOT exhaustively tests all six degrees of freedom, rotational and translational, and produces a grid of approximate interaction energies and orientations. It is able to do this because the problem is cast as the convolution of the potential field of the first molecule and any rotated charge distribution of the second. The algorithm lends itself to both parallelization and vectorization, permitting huge increases in computational speed over other methods for obtaining the same information. For example, a complete mapping of interactions between plastocyanin and cytochrome c was done in eight minutes using 256 nodes of an Intel Paragon. DOT is expected to be particularly useful as a rapid screen to find configurations for more detailed study using exact energy models.
分子相互作用程序DOT(萝卜之子)的目的是快速计算两个蛋白质或其他带电分子之间的静电势能。DOT详尽地测试了所有六个自由度,旋转和平移,并产生了一个近似的相互作用能量和方向的网格。之所以能够这样做,是因为这个问题是第一个分子的势场和第二个分子的任意旋转电荷分布的卷积。该算法同时适用于并行化和向量化,与获取相同信息的其他方法相比,可以大大提高计算速度。例如,使用Intel Paragon的256个节点,在8分钟内完成了质体青素和细胞色素c之间相互作用的完整映射。DOT被认为是一种特别有用的快速筛选方法,可以使用精确的能量模型找到更详细的研究配置。
{"title":"Surveying Molecular Interactions with DOT","authors":"L. T. Eyck, J. Mandell, V. Roberts, M. Pique","doi":"10.1145/224170.224218","DOIUrl":"https://doi.org/10.1145/224170.224218","url":null,"abstract":"The purpose of the molecular interaction program DOT (Daughter of Turnip) is rapid computation of the electrostatic potential energy between two proteins or other charged molecules. DOT exhaustively tests all six degrees of freedom, rotational and translational, and produces a grid of approximate interaction energies and orientations. It is able to do this because the problem is cast as the convolution of the potential field of the first molecule and any rotated charge distribution of the second. The algorithm lends itself to both parallelization and vectorization, permitting huge increases in computational speed over other methods for obtaining the same information. For example, a complete mapping of interactions between plastocyanin and cytochrome c was done in eight minutes using 256 nodes of an Intel Paragon. DOT is expected to be particularly useful as a rapid screen to find configurations for more detailed study using exact energy models.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121304749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
HPC Undergraduate Curriculum Development at SDSU Using SDSC Resources 利用SDSC资源开发SDSU HPC本科课程
Pub Date : 1995-12-08 DOI: 10.1145/224170.224209
Kris Stewart
Results from the development and teaching of a senior-level undergraduate multidisciplinary course in high performance computing are presented. Having been taught four times, there are several "Lesson Learned" presented in this paper. Help from the technical staff at the San Diego Supercomputer Center and support from the National Science Foundation has been instrumental in the evolution of this course. The work of faculty at other universities has influenced the author's courses and is gratefully acknowledged. A subsequent sophomore level course was developed at SDSU and has become part of a voluntary, cooperative program, Undergraduate Computational Science and Engineering.
介绍了高性能计算本科高级多学科课程的开发与教学结果。已经教了四次,有几个“教训”在本文中提出。圣地亚哥超级计算机中心技术人员的帮助和美国国家科学基金会的支持对这门课程的发展起到了重要作用。其他大学教师的工作影响了作者的课程,并得到了感谢。随后在SDSU开设了一门大二水平的课程,并成为自愿合作项目“本科计算科学与工程”的一部分。
{"title":"HPC Undergraduate Curriculum Development at SDSU Using SDSC Resources","authors":"Kris Stewart","doi":"10.1145/224170.224209","DOIUrl":"https://doi.org/10.1145/224170.224209","url":null,"abstract":"Results from the development and teaching of a senior-level undergraduate multidisciplinary course in high performance computing are presented. Having been taught four times, there are several \"Lesson Learned\" presented in this paper. Help from the technical staff at the San Diego Supercomputer Center and support from the National Science Foundation has been instrumental in the evolution of this course. The work of faculty at other universities has influenced the author's courses and is gratefully acknowledged. A subsequent sophomore level course was developed at SDSU and has become part of a voluntary, cooperative program, Undergraduate Computational Science and Engineering.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121967495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Novel Approach Towards Automatic Data Distribution 一种自动数据分发的新方法
Pub Date : 1995-12-08 DOI: 10.1145/224170.224500
Jordi Garcia, E. Ayguadé, Jesús Labarta
Data distribution is one of the key aspects that a parallelizing compiler for a distributed memory architecture should consider, in order to get efficiency from the system. The cost of accessing local and remote data can be one or several orders of magnitude different, and this can dramatically affect performance. In this paper, we present a novel approach to automatically perform static data distribution. All the constraints related to parallelism and data movement are contained in a single data structure, the Communication-Parallelism Graph (CPG). The problem is solved using a linear 0-1 integer programming model and solver. In this paper we present the solution for one-dimensional array distributions, although its extension to multi-dimensional array distributions is also outlined. The solution is static in the sense that the layout of the arrays does not change during the execution of the program. We also show the feasibility of using this approach to solve the problem in terms of compilation time and quality of the solutions generated.
为了从系统中获得效率,数据分布是分布式内存架构的并行编译器应该考虑的关键方面之一。访问本地和远程数据的成本可能相差一个或几个数量级,这可能会极大地影响性能。本文提出了一种自动执行静态数据分布的新方法。所有与并行性和数据移动相关的约束都包含在一个数据结构中,即通信并行图(Communication-Parallelism Graph, CPG)。利用线性0-1整数规划模型和求解器对问题进行了求解。本文给出了一维阵列分布的解,并将其推广到多维阵列分布。解决方案是静态的,因为数组的布局在程序执行期间不会改变。我们还展示了使用这种方法在编译时间和生成的解决方案质量方面解决问题的可行性。
{"title":"A Novel Approach Towards Automatic Data Distribution","authors":"Jordi Garcia, E. Ayguadé, Jesús Labarta","doi":"10.1145/224170.224500","DOIUrl":"https://doi.org/10.1145/224170.224500","url":null,"abstract":"Data distribution is one of the key aspects that a parallelizing compiler for a distributed memory architecture should consider, in order to get efficiency from the system. The cost of accessing local and remote data can be one or several orders of magnitude different, and this can dramatically affect performance. In this paper, we present a novel approach to automatically perform static data distribution. All the constraints related to parallelism and data movement are contained in a single data structure, the Communication-Parallelism Graph (CPG). The problem is solved using a linear 0-1 integer programming model and solver. In this paper we present the solution for one-dimensional array distributions, although its extension to multi-dimensional array distributions is also outlined. The solution is static in the sense that the layout of the arrays does not change during the execution of the program. We also show the feasibility of using this approach to solve the problem in terms of compilation time and quality of the solutions generated.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123203336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Parallelizing the Phylogeny Problem 系统发育问题的并行化
Pub Date : 1995-12-08 DOI: 10.1145/224170.224224
Je Jones, K. Yelick
The problem of determining the evolutionary history of species in the form of phylogenetic trees is known as the phylogeny problem. We present a parallelization of the character compatibility method for solving the phylogeny problem. Abstractly, the algorithm searches through all subsets of characters, which may be traits like opposable thumbs or DNA sequence values, looking for a maximal consistent subset. The notion of consistency in this case is the existence of a particular kind of phylogenetic tree called a perfect phylogeny tree. The two challenges to achieving an efficient implementation are load balancing and efficient sharing of information to enable pruning. In both cases, there is a trade-off between communication overhead and the quality of the solution. For load balancing we use a distributed task queue, which has imperfect load information but avoids centralization bottlenecks. For sharing pruning information, we use a distributed trie, which also avoids centralization but maintains incomplete information. We evaluate several implementations of the trie, the best of which achieves speedups of 50 on a 64-processor CM-5.
以系统发生树的形式确定物种进化史的问题被称为系统发生问题。我们提出了一种并行化的字符兼容性方法来解决系统发育问题。抽象地说,该算法搜索字符的所有子集,这些子集可能是对生拇指或DNA序列值等特征,寻找最大的一致子集。在这种情况下,一致性的概念是存在一种特殊的系统发生树,称为完美系统发生树。实现高效实现的两个挑战是负载平衡和有效共享信息以支持修剪。在这两种情况下,都需要在通信开销和解决方案质量之间进行权衡。为了实现负载平衡,我们使用分布式任务队列,它具有不完全的负载信息,但避免了集中化瓶颈。为了共享剪枝信息,我们使用分布式树,避免了信息的集中化,但保留了不完整的信息。我们评估了该trie的几种实现,其中最好的实现在64处理器的CM-5上实现了50%的加速。
{"title":"Parallelizing the Phylogeny Problem","authors":"Je Jones, K. Yelick","doi":"10.1145/224170.224224","DOIUrl":"https://doi.org/10.1145/224170.224224","url":null,"abstract":"The problem of determining the evolutionary history of species in the form of phylogenetic trees is known as the phylogeny problem. We present a parallelization of the character compatibility method for solving the phylogeny problem. Abstractly, the algorithm searches through all subsets of characters, which may be traits like opposable thumbs or DNA sequence values, looking for a maximal consistent subset. The notion of consistency in this case is the existence of a particular kind of phylogenetic tree called a perfect phylogeny tree. The two challenges to achieving an efficient implementation are load balancing and efficient sharing of information to enable pruning. In both cases, there is a trade-off between communication overhead and the quality of the solution. For load balancing we use a distributed task queue, which has imperfect load information but avoids centralization bottlenecks. For sharing pruning information, we use a distributed trie, which also avoids centralization but maintains incomplete information. We evaluate several implementations of the trie, the best of which achieves speedups of 50 on a 64-processor CM-5.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123623797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Microparallelism and High-Performance Protein Matching 微并行和高性能蛋白质匹配
Pub Date : 1995-12-08 DOI: 10.1145/224170.224222
B. Alpern, L. Carter, K. Gatlin
The Smith-Waterman algorithm is a computationally-intensive string-matching operation that is fundamental to the analysis of proteins and genes. In this paper, we explore the use of some standard and novel techniques for improving its performance. We begin by tuning the algorithm using conventional techniques. These make modest performance improvements by providing efficient cache usage and inner-loop code. One novel technique uses the z-buffer operations of the Intel i860 architecture to perform 4 independent computations in parallel. This achieves a five-fold speedup over the optimized code (six-fold over the original). We also describe a related technique that could be used by processors that have 64-bit integer operations, but no z-buffer. Another new technique uses floating-point multiplies and adds in place of the standard algorithm's integer additions and maximum operations. This gains more than a three-fold speedup on the IBM POWER2 processor. This method doesn't give the identical answers as the original program, but experimental evidence shows that the inaccuracies are small and do not affect which strings are chosen as good matches by the algorithm.
Smith-Waterman算法是一种计算密集型的字符串匹配操作,是蛋白质和基因分析的基础。在本文中,我们探讨了使用一些标准的和新颖的技术来提高它的性能。我们首先使用传统技术调整算法。它们通过提供高效的缓存使用和内循环代码,略微提高了性能。一种新颖的技术使用Intel i860架构的z-buffer操作来并行执行4个独立的计算。这比优化后的代码实现了5倍的加速(比原始代码提高了6倍)。我们还描述了一种相关技术,该技术可用于具有64位整数操作但没有z缓冲区的处理器。另一种新技术使用浮点乘法和加法来代替标准算法的整数加法和最大值运算。这在IBM POWER2处理器上获得了三倍以上的速度提升。该方法不能给出与原始程序相同的答案,但实验证据表明,不准确性很小,并且不影响算法选择哪些字符串作为良好匹配。
{"title":"Microparallelism and High-Performance Protein Matching","authors":"B. Alpern, L. Carter, K. Gatlin","doi":"10.1145/224170.224222","DOIUrl":"https://doi.org/10.1145/224170.224222","url":null,"abstract":"The Smith-Waterman algorithm is a computationally-intensive string-matching operation that is fundamental to the analysis of proteins and genes. In this paper, we explore the use of some standard and novel techniques for improving its performance. We begin by tuning the algorithm using conventional techniques. These make modest performance improvements by providing efficient cache usage and inner-loop code. One novel technique uses the z-buffer operations of the Intel i860 architecture to perform 4 independent computations in parallel. This achieves a five-fold speedup over the optimized code (six-fold over the original). We also describe a related technique that could be used by processors that have 64-bit integer operations, but no z-buffer. Another new technique uses floating-point multiplies and adds in place of the standard algorithm's integer additions and maximum operations. This gains more than a three-fold speedup on the IBM POWER2 processor. This method doesn't give the identical answers as the original program, but experimental evidence shows that the inaccuracies are small and do not affect which strings are chosen as good matches by the algorithm.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120967891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet 工作站上的高性能消息传递:用于Myrinet的伊利诺伊快速消息(FM)
Pub Date : 1995-12-08 DOI: 10.1109/SUPERC.1995.32
S. Pakin, Mario Lauria, A. Chien
In most computer systems, software overhead dominates the cost of messaging, reducing delivered performance, especially for short messages. Efficient software messaging layers are needed to deliver the hardware performance to the application level and to support tightly-coupled workstation clusters. Illinois Fast Messages (FM) 1.0 is a high speed messaging layer that delivers low latency and high bandwidth for short messages. For 128-byte packets, FM achieves bandwidths of 16.2MB/s and one-way latencies 32 µs on Myrinet-connected SPARCstations (user-level to user-level). For shorter packets, we have measured one-way latencies of 25 µs, and for larger packets, bandwidth as high as to 19.6MB/s — delivered bandwidth greater than OC-3. FM is also superior to the Myrinet API messaging layer, not just in terms of latency and usable bandwidth, but also in terms of the message half-power point (n_{frac{1}{2}}), which is two orders of magnitude smaller (54 vs. 4,409 bytes). We describe the FM messaging primitives and the critical design issues in building a low-latency messaging layers for workstation clusters. Several issues are critical: the division of labor between host and network coprocessor, management of the input/output (I/O) bus, and buffer management. To achieve high performance, messaging layers should assign as much functionality as possible to the host. If the network interface has DMA capability, the I/Obus should be used asymmetrically, with the host processor moving data to the network and exploiting DMA to move data to the host. Finally, buffer management should be extremely simple in the network coprocessor and match queue structures between the network coprocessor and host memory. Detailed measurements show how each of these features contribute to high performance.
在大多数计算机系统中,软件开销主导了消息传递的成本,降低了交付的性能,特别是对于短消息。需要高效的软件消息传递层来将硬件性能交付到应用程序级别并支持紧密耦合的工作站集群。Illinois Fast Messages (FM) 1.0是一种高速消息传递层,可为短消息提供低延迟和高带宽。对于128字节的数据包,FM在myrinet连接的sparcstation(用户级到用户级)上实现了16.2MB/s的带宽和32µs的单向延迟。对于较短的数据包,我们测量了25µs的单向延迟,对于较大的数据包,带宽高达19.6MB/s -交付带宽大于OC-3。FM也优于Myrinet API消息传递层,不仅在延迟和可用带宽方面,而且在消息半功率点(n_{frac{1}{2}})方面,前者比后者小两个数量级(54字节比4409字节)。我们描述了FM消息传递原语以及为工作站集群构建低延迟消息传递层时的关键设计问题。有几个问题是至关重要的:主机和网络协处理器之间的分工、输入/输出(I/O)总线的管理和缓冲区管理。为了实现高性能,消息传递层应该将尽可能多的功能分配给主机。如果网络接口具有DMA功能,则应该不对称地使用I/Obus,由主机处理器将数据移动到网络,并利用DMA将数据移动到主机。最后,网络协处理器中的缓冲区管理应该非常简单,并匹配网络协处理器和主机内存之间的队列结构。详细的测量显示了这些特性对高性能的贡献。
{"title":"High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet","authors":"S. Pakin, Mario Lauria, A. Chien","doi":"10.1109/SUPERC.1995.32","DOIUrl":"https://doi.org/10.1109/SUPERC.1995.32","url":null,"abstract":"In most computer systems, software overhead dominates the cost of messaging, reducing delivered performance, especially for short messages. Efficient software messaging layers are needed to deliver the hardware performance to the application level and to support tightly-coupled workstation clusters. Illinois Fast Messages (FM) 1.0 is a high speed messaging layer that delivers low latency and high bandwidth for short messages. For 128-byte packets, FM achieves bandwidths of 16.2MB/s and one-way latencies 32 µs on Myrinet-connected SPARCstations (user-level to user-level). For shorter packets, we have measured one-way latencies of 25 µs, and for larger packets, bandwidth as high as to 19.6MB/s — delivered bandwidth greater than OC-3. FM is also superior to the Myrinet API messaging layer, not just in terms of latency and usable bandwidth, but also in terms of the message half-power point (n_{frac{1}{2}}), which is two orders of magnitude smaller (54 vs. 4,409 bytes). We describe the FM messaging primitives and the critical design issues in building a low-latency messaging layers for workstation clusters. Several issues are critical: the division of labor between host and network coprocessor, management of the input/output (I/O) bus, and buffer management. To achieve high performance, messaging layers should assign as much functionality as possible to the host. If the network interface has DMA capability, the I/Obus should be used asymmetrically, with the host processor moving data to the network and exploiting DMA to move data to the host. Finally, buffer management should be extremely simple in the network coprocessor and match queue structures between the network coprocessor and host memory. Detailed measurements show how each of these features contribute to high performance.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126608000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 476
Gigabit I/O for Distributed-Memory Machines: Architecture and Applications 分布式内存机器的千兆I/O:体系结构和应用
Pub Date : 1995-12-08 DOI: 10.1145/224170.224375
Michael Hemy, P. Steenkiste
Distributed-memory systems have traditionally had great difficulty performing network I/O at rates proportional to their computational power. The problem is that the network interface has to support network I/O for a supercomputer, using computational and memory bandwidth resources similar to those of a workstation. As a result, the network interface becomes a bottleneck. We implemented an architecture for network I/O for the iWarp system with the following two key characteristics: first, application-specific tasks are off-loaded from the network interface to the distributed-memory system, and second, these tasks are performed in close cooperation with the application. The network interface has been used by several applications for over a year. In this paper we describe the network interface software that manages the communication between the iWarp distributed-memory system and the network interface, we validate the main features of our network interface architecture based on application experience, and we discuss how this architecture can be used by other distributed-memory systems.
传统上,分布式内存系统很难以与其计算能力成比例的速率执行网络I/O。问题是网络接口必须支持超级计算机的网络I/O,使用与工作站类似的计算和内存带宽资源。因此,网络接口成为瓶颈。我们为iWarp系统实现了一个具有以下两个关键特征的网络I/O架构:首先,特定于应用程序的任务从网络接口卸载到分布式内存系统;其次,这些任务与应用程序密切合作执行。网络接口已经被几个应用程序使用了一年多。本文描述了用于管理iWarp分布式存储系统与网络接口之间通信的网络接口软件,根据应用经验验证了我们的网络接口架构的主要特点,并讨论了该架构如何在其他分布式存储系统中使用。
{"title":"Gigabit I/O for Distributed-Memory Machines: Architecture and Applications","authors":"Michael Hemy, P. Steenkiste","doi":"10.1145/224170.224375","DOIUrl":"https://doi.org/10.1145/224170.224375","url":null,"abstract":"Distributed-memory systems have traditionally had great difficulty performing network I/O at rates proportional to their computational power. The problem is that the network interface has to support network I/O for a supercomputer, using computational and memory bandwidth resources similar to those of a workstation. As a result, the network interface becomes a bottleneck. We implemented an architecture for network I/O for the iWarp system with the following two key characteristics: first, application-specific tasks are off-loaded from the network interface to the distributed-memory system, and second, these tasks are performed in close cooperation with the application. The network interface has been used by several applications for over a year. In this paper we describe the network interface software that manages the communication between the iWarp distributed-memory system and the network interface, we validate the main features of our network interface architecture based on application experience, and we discuss how this architecture can be used by other distributed-memory systems.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121524154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation 共享地址空间多处理器集群的优势:一项应用驱动的研究
Pub Date : 1995-12-08 DOI: 10.1145/224170.224397
Andrew Erlichson, B. A. Nayfeh, J. Singh, K. Olukotun
Clustering processors together at a level of the memory hierarchy in shared address space multiprocessors appears to be an attractive technique from several standpoints: Resources are shared, packaging technologies are exploited, and processors within a cluster can share data more effectively. We investigate the performance benefits that can be obtained by clustering on a range of important scientific and engineering applications in moderate to large scale cache coherent machines with small degrees of clustering (up to one eighth of the total number of processors in a cluster). We find that except for applications with near neighbor communication topologies this degree of clustering is not very effective in reducing the inherent communication to computation ratios. Clustering is more useful in reducing the the number of remote capacity misses in unstructured applications, and can improve performance substantially when small first-level caches are clustered in these cases. This suggests that clustering at the first level cache might be useful in highly-integrated, relatively fine-grained environments. For less integrated machines such as current distributed shared memory multiprocessors, our results suggest that clustering at the first-level caches is not very useful in improving application performance; however our results also suggest that in an machine with long interprocessor communication latencies, clustering further away from the processor can provide performance benefits.
在共享地址空间多处理器的内存层次结构中,将处理器聚集在一起从几个角度来看似乎是一种有吸引力的技术:资源是共享的,封装技术得到利用,集群中的处理器可以更有效地共享数据。我们研究了在具有小集群程度(最多集群中处理器总数的八分之一)的中型到大型缓存一致机器上对一系列重要的科学和工程应用程序进行集群所能获得的性能优势。我们发现,除了具有近邻通信拓扑的应用程序外,这种程度的聚类在降低固有通信与计算比方面不是很有效。集群在减少非结构化应用程序中远程容量丢失的数量方面更有用,并且在这些情况下,当小型一级缓存集群时,可以大大提高性能。这表明,第一级缓存的集群在高度集成、相对细粒度的环境中可能很有用。对于集成度较低的机器,如当前的分布式共享内存多处理器,我们的结果表明,在一级缓存上集群对提高应用程序性能不是很有用;然而,我们的结果还表明,在处理器间通信延迟较长的机器中,远离处理器的集群可以提供性能优势。
{"title":"The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation","authors":"Andrew Erlichson, B. A. Nayfeh, J. Singh, K. Olukotun","doi":"10.1145/224170.224397","DOIUrl":"https://doi.org/10.1145/224170.224397","url":null,"abstract":"Clustering processors together at a level of the memory hierarchy in shared address space multiprocessors appears to be an attractive technique from several standpoints: Resources are shared, packaging technologies are exploited, and processors within a cluster can share data more effectively. We investigate the performance benefits that can be obtained by clustering on a range of important scientific and engineering applications in moderate to large scale cache coherent machines with small degrees of clustering (up to one eighth of the total number of processors in a cluster). We find that except for applications with near neighbor communication topologies this degree of clustering is not very effective in reducing the inherent communication to computation ratios. Clustering is more useful in reducing the the number of remote capacity misses in unstructured applications, and can improve performance substantially when small first-level caches are clustered in these cases. This suggests that clustering at the first level cache might be useful in highly-integrated, relatively fine-grained environments. For less integrated machines such as current distributed shared memory multiprocessors, our results suggest that clustering at the first-level caches is not very useful in improving application performance; however our results also suggest that in an machine with long interprocessor communication latencies, clustering further away from the processor can provide performance benefits.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116765219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
SCIRun: A Scientific Programming Environment for Computational Steering SCIRun:用于计算导向的科学编程环境
Pub Date : 1995-12-08 DOI: 10.1145/224170.224354
S. Parker, C. R. Johnson
We present the design, implementation and application of SCIRun, a scientific programming environment that allows the interactive construction, debugging and steering of large scale scientific computations. Using this "computational workbench," a scientist can design and modify simulations interactively via a dataflow programming model. SCIRun enables scientists to design and modify models and automatically change parameters and boundary conditions as well as the mesh discretization level needed for an accurate numerical solution. As opposed to the typical "off-line" simulation mode - in which the scientist manually sets input parameters, computes results, visualizes the results via a separate visualization package, then starts again at the beginning - SCIRun "closes the loop" and allows interactive steering of the design and computation phases of the simulation. To make the dataflow programming paradigm applicable to large scientific problems, we have identified ways to avoid the excessive memory use inherent in standard dataflow implementations, and have implemented fine-grained dataflow in order to further promote computational efficiency. In this paper, we describe applications of the SCIRun system to several problems in computational medicine. In addition, an we have included an interactive demo program in the form of an application of SCIRun system to a small electrostatic field problem.
本文介绍了SCIRun的设计、实现和应用。SCIRun是一个科学编程环境,可以实现大规模科学计算的交互式构建、调试和转向。使用这个“计算工作台”,科学家可以通过数据流编程模型交互式地设计和修改模拟。SCIRun使科学家能够设计和修改模型,自动改变参数和边界条件,以及精确数值解所需的网格离散化水平。与典型的“离线”模拟模式(科学家手动设置输入参数,计算结果,通过单独的可视化软件包将结果可视化,然后重新开始)相反,SCIRun“闭环”并允许交互控制模拟的设计和计算阶段。为了使数据流编程范式适用于大型科学问题,我们已经确定了避免标准数据流实现中固有的过度内存使用的方法,并实现了细粒度数据流以进一步提高计算效率。在本文中,我们描述了SCIRun系统在计算医学中的几个问题的应用。此外,我们还包含了一个交互式演示程序,以SCIRun系统在小静电场问题中的应用为形式。
{"title":"SCIRun: A Scientific Programming Environment for Computational Steering","authors":"S. Parker, C. R. Johnson","doi":"10.1145/224170.224354","DOIUrl":"https://doi.org/10.1145/224170.224354","url":null,"abstract":"We present the design, implementation and application of SCIRun, a scientific programming environment that allows the interactive construction, debugging and steering of large scale scientific computations. Using this \"computational workbench,\" a scientist can design and modify simulations interactively via a dataflow programming model. SCIRun enables scientists to design and modify models and automatically change parameters and boundary conditions as well as the mesh discretization level needed for an accurate numerical solution. As opposed to the typical \"off-line\" simulation mode - in which the scientist manually sets input parameters, computes results, visualizes the results via a separate visualization package, then starts again at the beginning - SCIRun \"closes the loop\" and allows interactive steering of the design and computation phases of the simulation. To make the dataflow programming paradigm applicable to large scientific problems, we have identified ways to avoid the excessive memory use inherent in standard dataflow implementations, and have implemented fine-grained dataflow in order to further promote computational efficiency. In this paper, we describe applications of the SCIRun system to several problems in computational medicine. In addition, an we have included an interactive demo program in the form of an application of SCIRun system to a small electrostatic field problem.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 359
期刊
Proceedings of the IEEE/ACM SC95 Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1