Proceedings of the IEEE/ACM SC95 Conference最新文献

英文中文

Compiling and Optimizing for Decoupled Architectures 解耦体系结构的编译和优化

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224301

N. Topham, A. Rawsthorne, Callum McLean, M. Mewissen, Peter L. Bird

Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.

解耦架构通过其隐藏大内存延迟的能力，为持续的超级计算机性能问题提供了一个关键。当程序以解耦模式执行时，处理器感知到的内存延迟为零;实际上，整个物理内存的访问时间相当于处理器的寄存器文件，并且延迟是完全隐藏的。然而，解耦体系结构中的异步功能单元必须偶尔同步，这将带来很高的代价。对解耦体系结构进行编译和优化的目标是在异步功能单元之间对程序进行分区，这样可以隐藏延迟，但不频繁地执行同步事件。本文描述了一个解耦编译模型，并解释了解耦系统编译的有效性。介绍了许多新的编译器优化，并使用Perfect Club科学基准对其进行了定量评估。通过适当的优化，我们可以在Perfect Club中的大多数程序中隐藏大部分时间的大延迟。

{"title":"Compiling and Optimizing for Decoupled Architectures","authors":"N. Topham, A. Rawsthorne, Callum McLean, M. Mewissen, Peter L. Bird","doi":"10.1145/224170.224301","DOIUrl":"https://doi.org/10.1145/224170.224301","url":null,"abstract":"Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127924457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Message Passing Versus Distributed Shared Memory on Networks of Workstations 消息传递与工作站网络上的分布式共享内存

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224285

Honghui Lu, S. Dwarkadas, A. Cox, W. Zwaenepoel

The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the shared memory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR), Traveling Salesman (TSP), and Quicksort (QSORT). Two different input data sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). Our execution environment is a set of eight HP735 workstations connected by a 100Mbits per second FDDI network. For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks is within 10%of PVM. For IS-Small, Water-288, Barnes-Hut, 3-D FFT, TSP, and QSORT, differences are on the order of 10%to 30%. Finally, for IS-Large, PVM performs two times better than TreadMarks. More messages and more data are sent in TreadMarks, explaining the performance differences. This extra communication is caused by 1) the separation of synchronization and data transfer, 2) extra messages to request updates for data by the invalidate protocol used in TreadMarks, 3) false sharing, and 4) diff accumulation for migratory data in TreadMarks.

消息传递程序使用PVM (Parallel Virtual Machine)库执行，共享内存程序使用TreadMarks执行。这些程序是SPLASH基准套件中的Water和Barnes-Hut;NAS基准中的3-D FFT、整数排序(IS)和尴尬并行(EP)ILINK，一个广泛使用的遗传连锁分析程序;连续超松弛法(SOR)、旅行推销员法(TSP)和快速排序法(QSORT)。两种不同的输入数据集用于Water (Water-288和Water-1728)、IS (IS- small和IS- large)和SOR (SOR- zero和SOR- nonzero)。我们的执行环境是一组8个HP735工作站，通过每秒100mbit / s的FDDI网络连接。对于Water-1728、EP、ILINK、SOR-Zero和SOR-NonZero, TreadMarks的性能在PVM的10%以内。对于IS-Small、Water-288、Barnes-Hut、3-D FFT、TSP和QSORT，差异在10%到30%之间。最后，对于IS-Large, PVM的性能是TreadMarks的两倍。在TreadMarks中发送了更多的消息和数据，这解释了性能差异。这种额外的通信是由以下原因造成的:1)同步和数据传输的分离，2)在TreadMarks中使用的invalidate协议请求更新数据的额外消息，3)错误共享，4)在TreadMarks中迁移数据的困难积累。

{"title":"Message Passing Versus Distributed Shared Memory on Networks of Workstations","authors":"Honghui Lu, S. Dwarkadas, A. Cox, W. Zwaenepoel","doi":"10.1145/224170.224285","DOIUrl":"https://doi.org/10.1145/224170.224285","url":null,"abstract":"The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the shared memory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR), Traveling Salesman (TSP), and Quicksort (QSORT). Two different input data sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). Our execution environment is a set of eight HP735 workstations connected by a 100Mbits per second FDDI network. For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks is within 10%of PVM. For IS-Small, Water-288, Barnes-Hut, 3-D FFT, TSP, and QSORT, differences are on the order of 10%to 30%. Finally, for IS-Large, PVM performs two times better than TreadMarks. More messages and more data are sent in TreadMarks, explaining the performance differences. This extra communication is caused by 1) the separation of synchronization and data transfer, 2) extra messages to request updates for data by the invalidate protocol used in TreadMarks, 3) false sharing, and 4) diff accumulation for migratory data in TreadMarks.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116302175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 112

A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory Parallel Computer 凸型SPP-1000可扩展共享内存并行计算机的性能评价

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.285573

T. Sterling, D. Savarese, P. MacNeice, K. Olson, C. Mobarry, B. Fryxell, P. Merkey

The Convex SPP-1000 is the first commercial implementation of a new generation of scalable shared memory parallel computers with full cache coherence. It employs a hierarchical structure of processing communication and memory name-space management resources to provide a scalableNUMA environment. Ensembles of 8 HP PA-RISC7100 microprocessorsemploy an internal cross-bar switch and directory based cache coherence scheme to provide a tightly coupled SMP.Up to 16 processing ensembles are interconnected by a 4 ring network incorporating a full hardware implementation of the SCI protocol for a full system configuration of 128 processors. This paper presents the findings of a set of empirical studies using both synthetic test codes and full applications for the Earth and space sciences to characterize the performance properties of this new architecture. It is shown that overhead and latencies of global primitive mechanisms, while low in absolute time, are significantly more costly than similar functions local to an individual processor ensemble.

凸SPP-1000是具有完全缓存一致性的新一代可扩展共享内存并行计算机的第一个商业实现。它采用处理通信和内存命名空间管理资源的分层结构来提供可伸缩的lenuma环境。8 HP PA-RISC7100微处理器的集成采用内部交叉开关和基于目录的缓存一致性方案，以提供紧密耦合的SMP。多达16个处理集成由一个4环网络连接，该网络结合了SCI协议的完整硬件实现，可用于128个处理器的完整系统配置。本文介绍了一组实证研究的结果，使用综合测试代码和地球和空间科学的完整应用程序来表征这种新架构的性能特性。结果表明，全局原语机制的开销和延迟虽然在绝对时间上较低，但比单个处理器集成的局部类似功能的成本要高得多。

引用次数: 5

The Use of Cellular Automata in the Classroom 元胞自动机在课堂中的应用

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224204

H. A. Lilly

The paper explains what a cellular automaton is and why schools would want to integrate the study of cellular automata into their curricula. Examples are given and suggestions for sample exercises follow. Each example is given a title, a discipline to which it relates, a source from which the example or the motivation for the example was taken, and a recommended grade level--middle school or high school. Source code in Microsoft's FORTRAN PowerStation, Version 1.0 is available for all of the examples. Each of the programs show a visualization of a particular cellular automaton over time. A cellular automaton is a modeling tool that can be used in the classroom with either pencil and paper or on computers. Cellular automata can be important in motivating students, reaching students with certain learning styles, helping students develop modeling skills, and in the development of curricula for teaching certain computer technologies.

这篇论文解释了什么是元胞自动机，以及为什么学校想要将元胞自动机的研究纳入他们的课程。给出了示例并给出了示例练习的建议。每个例子都有一个标题，一个与之相关的学科，一个例子的来源或例子的动机，以及一个推荐的年级水平——初中或高中。微软FORTRAN PowerStation 1.0版本中的源代码可用于所有示例。每个程序都显示了一个特定的元胞自动机随时间的可视化。元胞自动机是一种建模工具，可以在教室里用笔和纸或在电脑上使用。元胞自动机在激励学生、培养特定学习风格的学生、帮助学生发展建模技能以及开发教授某些计算机技术的课程方面很重要。

引用次数: 7

Astrophysical N-Body Simulations on the GRAPE-4 Special-Purpose Computer 在GRAPE-4专用计算机上的天体物理n体模拟

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224400

J. Makino, M. Taiji

We report on resent astrophysical N-body simulations performed on the GRAPE-4 (GRAvity PipE 4) system, a special-purpose computer for astrophysical N-body simulations. We first review the astrophysical motivation, the algorithm, the structure of the GRAPE system, and the actual performance. The GRAPE-4 system consists of 1692 pipeline processors. The peak speed of one pipeline processor is 523 Mflops and that of the total system is 884 Gflops. The performance obtained is 529 Gflops for the simulation of two massive black holes in the core of a galaxy with 700,000 stars.

我们报告了最近在用于天体物理n体模拟的专用计算机GRAPE-4 (GRAvity PipE 4)系统上进行的天体物理n体模拟。我们首先回顾了天体物理动机、算法、葡萄系统的结构和实际性能。graph -4系统由1692个流水线处理器组成。单个流水线处理器的峰值速度为523 Mflops，整个系统的峰值速度为884 Gflops。对于一个拥有70万颗恒星的星系核心的两个大质量黑洞的模拟，获得的性能为529 Gflops。

引用次数: 17

Multicast Virtual Topologies for Collective Communication in MPCs and ATM Clusters mpc和ATM集群中集合通信的组播虚拟拓扑

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224188

Y. Huang, Chengchang Huang, P. McKinley

This paper defines and describes the properties of a multicast virtual topology, the M-array and a resource-efficient variation, the REM-array. It is shown how several collective operations can be implemented efficiently using these virtual topologies, while maintaining low complexity. Because the methods are applicable to any parallel computing environment that supports multicast communication in hardware, they provide a framework for collective communication libraries that are portable and yet take advantage of such low-level hardware functionality. In particular, the paper describes the practical issues of using these methods in wormhole-routed massively parallel computers (MPCs) and in workstation clusters connected by Asynchronous Transfer Mode (ATM) networks. Performance results are given for both environments.

本文定义并描述了组播虚拟拓扑m -阵列和资源高效变体rem -阵列的特性。演示了如何使用这些虚拟拓扑有效地实现几个集合操作，同时保持较低的复杂性。由于这些方法适用于在硬件中支持多播通信的任何并行计算环境，因此它们为可移植的集合体通信库提供了一个框架，这些库可以利用这种低级硬件功能。特别地，本文描述了在虫洞路由的大规模并行计算机(mpc)和通过异步传输模式(ATM)网络连接的工作站集群中使用这些方法的实际问题。给出了两种环境下的性能结果。

引用次数: 12

Pittsburgh Supercomputing Center High School Initiative in Computational Science Report on Findings School Years: 1991-92, 1992-93, 1993-4 匹兹堡超级计算中心高中计算科学研究报告学年:1991-92、1992-93、1993-4

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224200

C. Porto

The purpose of the Pittsburgh Supercomputing Center's High School Initiative was to motivate students to pursue careers in science, mathematics, engineering and computer science. The initiative generated excitement among teachers and their students by providing them with the opportunity to work on a project of their choosing using the world's fastest supercomputer — the same machine used by leading researchers working on today's most challenging scientific problems. The program gave teachers the means and support to institutionalize their computational science project into the curriculum so that the impact of the program would continue from year to year with each new class of students.

匹兹堡超级计算中心高中计划的目的是激励学生追求科学、数学、工程和计算机科学方面的职业。这项计划让教师和学生们兴奋不已，因为他们有机会使用世界上最快的超级计算机来完成自己选择的项目，而当今最具挑战性的科学问题的研究人员也在使用同样的机器。该计划为教师提供了手段和支持，使他们的计算科学项目制度化，纳入课程，使该计划的影响能够年复一年地持续到每个新班级的学生身上。

引用次数: 1

I/O Limitations in Parallel Molecular Dynamics 并行分子动力学中的I/O限制

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224220

T. Clark, L. R. Scott, S. Wlodek, J. McCammon

We discuss data production rates and their impact on the performance of scientific applications using parallel computers. On one hand, too high rates of data production can be overwhelming, exceeding logistical capacities for transfer, storage and analysis. On the other hand, the rate limiting step in a computationally-based study should be the human-guided analysis, not the calculation. We present performance data for a biomolecular simulation of the enzyme, acetylcholinesterase, which uses the parallel molecular dynamics program EulerGROMOS. The actual production rates are compared against a typical time frame for results analysis where we show that the rate limiting step is the simulation, and that to overcome this will require improved output rates.

我们讨论数据产生率及其对使用并行计算机的科学应用程序性能的影响。一方面，过高的数据产生速度可能会压倒一切，超出传输、存储和分析的后勤能力。另一方面，在基于计算的研究中，速率限制步骤应该是人为指导的分析，而不是计算。我们提出了乙酰胆碱酯酶的生物分子模拟的性能数据，该模拟使用平行分子动力学程序EulerGROMOS。将实际产量与典型的时间框架进行比较，以进行结果分析，其中我们表明速率限制步骤是模拟，并且为了克服这一点将需要提高产量。

引用次数: 3

A Multi-Level Algorithm For Partitioning Graphs 图的多级划分算法

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224228

B. Hendrickson, R. Leland

The graph partitioning problem is that of dividing the vertices of a graph into sets of specified sizes such that few edges cross between sets. This NP-complete problem arises in many important scientific and engineering problems. Prominent examples include the decomposition of data structures for parallel computation, the placement of circuit elements and the ordering of sparse matrix computations. We present a multilevel algorithm for graph partitioning in which the graph is approximated by a sequence of increasingly smaller graphs. The smallest graph is then partitioned using a spectral method, and this partition is propagated back through the hierarchy of graphs. A variant of the Kernighan-Lin algorithm is applied periodically to refine the partition. The entire algorithm can be implemented to execute in time proportional to the size of the original graph. Experiments indicate that, relative to other advanced methods, the multilevel algorithm produces high quality partitions at low cost.

图划分问题是将图的顶点划分为特定大小的集合，使得集合之间很少有边交叉。这个np完全问题出现在许多重要的科学和工程问题中。突出的例子包括并行计算的数据结构分解，电路元件的放置和稀疏矩阵计算的排序。我们提出了一种多层图划分算法，其中图由一系列越来越小的图近似。然后使用谱方法对最小的图进行划分，并通过图的层次结构传播这种划分。周期性地应用Kernighan-Lin算法的一种变体来细化划分。整个算法的执行时间与原始图的大小成正比。实验表明，与其他先进方法相比，该算法能以较低的成本生成高质量的分区。

引用次数: 1300

A Parallel Software Infrastructure for Structured Adaptive Mesh Methods 结构化自适应网格方法的并行软件基础结构

Proceedings of the IEEE/ACM SC95 Conference

Pub Date : 1995-12-08 DOI: 10.1145/224170.224283

S. Kohn, S. Baden

Structured adaptive mesh algorithms dynamically allocate computational resources to accurately resolve interesting portions of a numerical calculation. Such methods are difficult to implement and parallelize because they rely on dynamic, irregular data structures. We have developed an efficient, portable, parallel software infrastructure for adaptive mesh methods; our software provides computational scientists with high-level facilities that hide low-level details of parallelism and resource management. We have applied our software infrastructure to the solution of adaptive eigenvalue problems arising in materials design. We describe our software infrastructure and analyze its performance. We also present computational results which indicate that the uniformity restrictions imposed by a data parallel Fortran implementation of a structured adaptive mesh application would significantly impact performance.

结构化自适应网格算法动态分配计算资源，以准确地解决数值计算的有趣部分。这些方法很难实现和并行化，因为它们依赖于动态的、不规则的数据结构。我们已经为自适应网格方法开发了一个高效、可移植、并行的软件基础设施;我们的软件为计算科学家提供了高级设施，隐藏了并行性和资源管理的低级细节。我们已经将我们的软件基础设施应用于解决材料设计中出现的自适应特征值问题。我们描述了我们的软件基础结构并分析了它的性能。我们还提供了计算结果，表明结构化自适应网格应用程序的数据并行Fortran实现所施加的均匀性限制将显著影响性能。

引用次数: 31

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the IEEE/ACM SC95 Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀