ACM/IEEE SC 2000 Conference (SC'00)最新文献

英文中文

Tiling Imperfectly-nested Loop Nests 平铺嵌套不完美的环形巢

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10018

Nawaaz Ahmed, N. Mateev, K. Pingali

Tiling is one of the more important transformations for enhancing locality of reference in programs. Intuitively, tiling a set of loops achieves the effect of interleaving iterations of these loops. Tiling of perfectly-nested loop nests (which are loop nests in which all assignment statements are contained in the innermost loop) is well understood. In practice, many loop nests are imperfectly-nested, so existing compilers use heuristics to try to find a sequence of transformations that convert such loop nests into perfectly-nested ones, but these heuristics do not always succeed. In this paper, we propose a novel approach to tiling imperfectly-nested loop nests. The key idea is to embed the iteration space of every statement in the imperfectly-nested loop nest into a special space called the product space which is tiled to produce the final code. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. No other single approach in the literature can tile all these codes automatically.

平铺是增强程序中引用的局部性的重要转换之一。直观地说，平铺一组循环可以达到这些循环交错迭代的效果。完美嵌套的循环巢(即所有赋值语句都包含在最内层循环中的循环巢)的平铺是很容易理解的。在实践中，许多循环巢都是不完美嵌套的，因此现有的编译器使用启发式方法试图找到一系列转换，将这些循环巢转换为完美嵌套的循环巢，但是这些启发式方法并不总是成功。在本文中，我们提出了一种新的方法来平铺不完美嵌套的循环巢。关键思想是将嵌套不完美的循环嵌套中的每个语句的迭代空间嵌入到一个称为乘积空间的特殊空间中，该空间被平铺以产生最终代码。我们评估了这种方法在密集数值线性代数基准测试、松弛代码和SPEC基准测试中的tomcatv代码中的有效性。在文献中没有其他单一的方法可以自动平铺所有这些代码。

引用次数: 79

Parallel Smoothed Aggregation Multigrid : Aggregation Strategies on Massively Parallel Machines 并行平滑聚合多网格:大规模并行机器上的聚合策略

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10008

R. Tuminaro, C. Tong

Algebraic multigrid methods offer the hope that multigrid convergence can be achieve (for at least some important applications) without a great deal of effort from engineers an scientists wishing to solve linear systems. In this paper we consider parallelization of the smoothe aggregation multigrid methods. Smoothed aggregation is one of the most promising algebraic multigrid methods. Therefore, eveloping parallel variants with both good convergence an efficiency properties is of great importance. However, parallelization is nontrivial due to the somewhat sequential aggregation (or grid coarsening) phase. In this paper, we discuss three different parallel aggregation algorithms an illustrate the advantages an disadvantages of each variant in terms of parallelism an convergence. Numerical results will be shown on the Intel Teraflop computer for some large problems coming from nontrivial codes: quasi-static electric potential simulation an a fluid flow calculation.

代数多重网格方法提供了一种希望，即无需工程师和科学家解决线性系统的大量工作，就可以实现多网格收敛(至少在某些重要应用中)。本文考虑了光滑聚合多网格方法的并行化问题。平滑聚合是目前最有前途的代数多重网格方法之一。因此，开发具有良好收敛性和效率的并行变异体是非常重要的。然而，由于有些顺序的聚合(或网格粗化)阶段，并行化是不平凡的。在本文中，我们讨论了三种不同的并行聚合算法，并从并行性和收敛性方面说明了每种算法的优缺点。将在Intel Teraflop计算机上显示一些来自非平凡代码的大问题的数值结果:准静态电位模拟和流体流动计算。

引用次数: 121

High-Performance Reactive Fluid Flow Simulations Using Adaptive Mesh Refinement on Thousands of Processors 在数千个处理器上使用自适应网格细化的高性能反应性流体流动模拟

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10010

A. Calder, B. C. Curtis, L. Dursi, B. Fryxell, G. Henry, P. MacNeice, K. Olson, P. Ricker, R. Rosner, F. Timmes, H. Tufo, J. W. Turan, M. Zingale

We present simulations and performance results of nuclear burning fronts in super- novae on the largest domain and at the finest spatial resolution studied to date. These simulations were performed on the Intel ASCI-Red machine at Sandia National Laboratories using FLASH, a code developed at the Center for Astrophysical Thermonuclear Flashes at the University of Chicago. FLASH is a modular, adaptive mesh, parallel simulation code capable of handling compressible, reactive fluid flows in astrophysical environments. FLASH is written primarily in Fortran 90, uses the Message-Passing Interface library for inter-processor communication and portability, and employs the PARAMESH package to manage a block-structured adaptive mesh that places blocks only where resolution is required and tracks rapidly changing flow features, such as detonation fronts, with ease. We describe the key algorithms and their implementation as well as the optimizations required to achieve sustained performance of 238 GFLOPS on 6420 processors of ASCI-Red in 64 bit arithmetic.

本文给出了在迄今为止研究的最大域和最佳空间分辨率下，超新星核燃烧锋的模拟和性能结果。这些模拟是在桑迪亚国家实验室的英特尔ascii - red机器上使用FLASH进行的，FLASH是芝加哥大学天体物理热核闪光中心开发的一种代码。FLASH是一种模块化、自适应网格、并行模拟代码，能够处理天体物理环境中的可压缩、反应性流体流动。FLASH主要使用Fortran 90编写，使用消息传递接口库进行处理器间通信和可移植性，并使用PARAMESH包来管理块结构的自适应网格，该网格仅在需要分辨率的地方放置块，并轻松跟踪快速变化的流特征，例如爆炸前沿。我们描述了关键算法及其实现，以及在6420个64位算术的ascii - red处理器上实现238 GFLOPS的持续性能所需的优化。

{"title":"High-Performance Reactive Fluid Flow Simulations Using Adaptive Mesh Refinement on Thousands of Processors","authors":"A. Calder, B. C. Curtis, L. Dursi, B. Fryxell, G. Henry, P. MacNeice, K. Olson, P. Ricker, R. Rosner, F. Timmes, H. Tufo, J. W. Turan, M. Zingale","doi":"10.1109/SC.2000.10010","DOIUrl":"https://doi.org/10.1109/SC.2000.10010","url":null,"abstract":"We present simulations and performance results of nuclear burning fronts in super- novae on the largest domain and at the finest spatial resolution studied to date. These simulations were performed on the Intel ASCI-Red machine at Sandia National Laboratories using FLASH, a code developed at the Center for Astrophysical Thermonuclear Flashes at the University of Chicago. FLASH is a modular, adaptive mesh, parallel simulation code capable of handling compressible, reactive fluid flows in astrophysical environments. FLASH is written primarily in Fortran 90, uses the Message-Passing Interface library for inter-processor communication and portability, and employs the PARAMESH package to manage a block-structured adaptive mesh that places blocks only where resolution is required and tracks rapidly changing flow features, such as detonation fronts, with ease. We describe the key algorithms and their implementation as well as the optimizations required to achieve sustained performance of 238 GFLOPS on 6420 processors of ASCI-Red in 64 bit arithmetic.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"605 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122940514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters 使用硬件计数器进行应用程序性能调优的可伸缩跨平台基础设施

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10029

S. Browne, J. Dongarra, N. Garner, K. London, P. Mucci

The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count "events", which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis and tuning. The PAPI project has proposed a standard set of hardware events and a standard cross-platform library interface to the underlying counter hardware. The PAPI library has been or is in the process of being implemented on all major HPC platforms. The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data. PAPI support is also being incorporated into a number of third-party tools.

PAPI项目的目的是指定一个标准API，用于访问大多数现代微处理器上可用的硬件性能计数器。这些计数器作为一小组寄存器存在，用于计数“事件”，这些事件是与处理器功能相关的特定信号和状态的出现。监视这些事件有助于源/目标代码的结构与将代码映射到底层体系结构的效率之间的关联。这种相关性在性能分析和调优中有多种用途。PAPI项目提出了一组标准的硬件事件和一个标准的跨平台库接口，用于底层计数器硬件。PAPI库已经或正在所有主要的HPC平台上实现。PAPI项目正在开发用于动态选择和显示硬件计数器性能数据的终端用户工具。PAPI支持也被纳入到许多第三方工具中。

引用次数: 253

The MicroGrid: a Scientific Tool for Modeling Computational Grids 微电网:模拟计算网格的科学工具

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1155/2000/481921

H. Song, Xianan Liu, D. Jakobsen, Ranjita Bhagwan, Xingbin Zhang, K. Taura, A. Chien

The complexity and dynamic nature of the Internet (and the emerging Computational Grid) demand that middleware and applications adapt to the changes in configuration and availability of resources. However, to the best of our knowledge there are no simulation tools which support systematic exploration of dynamic Grid software (or Grid resource) behavior. We describe our vision and initial efforts to build tools to meet these needs. Our MicroGrid simulation tools enable Globus applications to be run in arbitrary virtual grid resource environments, enabling broad experimentation. We describe the design of these tools, and their validation on micro- benchmarks, the NA parallel benchmarks, and an entire Grid application. These validation experiments show that the MicroGrid can match actual experiments within a few percent (2% to 4%).

Internet(以及新兴的计算网格)的复杂性和动态性要求中间件和应用程序适应配置和资源可用性的变化。然而，据我们所知，目前还没有仿真工具支持系统地探索动态网格软件(或网格资源)的行为。我们描述了构建工具以满足这些需求的愿景和最初的努力。我们的微网格仿真工具使Globus应用程序能够在任意虚拟网格资源环境中运行，从而实现广泛的实验。我们描述了这些工具的设计，以及它们在微基准测试、NA并行基准测试和整个网格应用程序上的验证。这些验证实验表明，微电网可以在几个百分点(2%到4%)的范围内与实际实验相匹配。

引用次数: 263

Hardware Prediction for Data Coherency of Scientific Codes on DSM DSM中科学码数据一致性的硬件预测

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10037

Jean-Thomas Acquaviva, W. Jalby

This paper proposes a hardware mechanism for reducing coherency overhead occurring in scientific computations within DSM systems. A first phase aims at detecting, in the address space regular patterns (called streams) of coherency events (such as requests for exclusive, shared or invalidation). Once a stream is detected at a loop level, regularity of data access can be exploited at the loop level (spatial locality) but also between loops (temporal locality). We present a hardware mechanism capable of detecting and exploiting efficiently these regular patterns. Expectable benefits as well as hardware complexity are discussed and the limited drawbacks and potential over-heads are exposed. For a benchmarks suite of typical scientific applications results are very promising both in terms of coherency streams and the effectiveness of our optimizations.

本文提出了一种硬件机制来降低DSM系统中科学计算中的相干开销。第一阶段的目标是在地址空间中检测一致性事件的常规模式(称为流)(例如独占、共享或无效请求)。一旦在循环级别检测到流，就可以在循环级别(空间局部性)以及循环之间(时间局部性)利用数据访问的规律性。我们提出了一种能够有效检测和利用这些规则模式的硬件机制。讨论了预期的好处和硬件复杂性，并揭示了有限的缺点和潜在的开销。对于典型科学应用程序的基准测试套件，从一致性流和优化的有效性两方面来看，结果都非常有希望。

引用次数: 2

The Implementation of MPI-2 One-Sided Communication for the NEC SX-5 NEC SX-5单片机MPI-2单侧通信的实现

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10023

J. Träff, H. Ritzdorf, R. Hempel

We describe the MPI/SX implementation of the MPI-2 standard for one-sided communication (Remote Memory Access) for the NEC SX-5 vector supercomputer. MPI/SX is a non-threaded implementation of the full MPI-2 standard. Essential features of the implementation are presented, including the synchronization mechanisms, the handling of communication windows in global shared and in process local memory, as well as the handling of MPI derived datatypes. In comparative benchmarks the data transfer operations for one-sided communication and point-to-point message passing show very similar performance, both when data reside in global shared and when in process local memory. Derived datatypes, which are of particular importance for applications using one-sided communications, impose only a modest overhead and can be used without any significant loss of performance. Thus, the MPI/SX programmer can freely choose either the message passing or the one-sided communication model, whichever is most convenient for the given application.

我们描述了MPI-2标准的MPI/SX实现，用于NEC SX-5矢量超级计算机的单侧通信(远程内存访问)。MPI/SX是完整MPI-2标准的非线程实现。介绍了该实现的基本特性，包括同步机制、全局共享内存和进程本地内存中通信窗口的处理以及MPI派生数据类型的处理。在比较基准测试中，单侧通信和点对点消息传递的数据传输操作表现出非常相似的性能，无论是数据驻留在全局共享内存中还是驻留在进程本地内存中。派生数据类型对于使用单侧通信的应用程序特别重要，它只会带来适度的开销，并且可以在没有任何显著性能损失的情况下使用。因此，MPI/SX程序员可以自由地选择消息传递或单侧通信模型，选择对给定应用程序最方便的一种。

引用次数: 50

Real-Time Biomechanical Simulation of Volumetric Brain Deformation for Image Guided Neurosurgery 图像引导神经外科中脑体积变形的实时生物力学模拟

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10043

S. Warfield, M. Ferrant, X. Gallez, A. Nabavi, F. Jolesz, R. Kikinis

We aimed to study the performance of a parallel implementation of an intraoperative nonrigid registration algorithm that accurately simulates the biomechanical properties of the brain and its deformations during surgery. The algorithm was designed to allow for improved surgical navigation and quantitative monitoring of treatment progress in order to improve the surgical outcome and to reduce the time required in the operating room. We have applied the algorithm to two neurosurgery cases with promising results. High performance computing is a key enabling technology that allows the biomechanical simulation to be executed quickly enough for the algorithm to be practical. Our parallel implementation was evaluated on a symmetric multi-processor and two clusters and exhibited similar performance characteristics on each. The implementation was sufficiently fast to be used in the operating room during a neurosurgery procedure. It allowed a three-dimensional volumetric deformation to be simulated in less than ten seconds.

我们的目的是研究术中非刚性配准算法的并行实现性能，该算法可以准确地模拟大脑的生物力学特性及其在手术过程中的变形。该算法旨在改进手术导航和治疗进展的定量监测，以改善手术效果并减少在手术室所需的时间。我们已经将该算法应用于两个神经外科病例，结果令人鼓舞。高性能计算是一项关键的使能技术，它使生物力学模拟能够足够快地执行，使算法具有实用性。我们的并行实现在一个对称多处理器和两个集群上进行了评估，并在每个集群上显示出相似的性能特征。在神经外科手术过程中，这种实现速度足够快，可以在手术室使用。它允许在不到10秒的时间内模拟三维体积变形。

引用次数: 69

Towards an Integrated, Web-executable Parallel Programming Tool Environment 迈向一个集成的、web可执行的并行编程工具环境

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.1109/SC.2000.10044

Insung Park, N. Kapadia, R. Figueiredo, R. Eigenmann, J. Fortes

We present a new parallel programming tool environment that is (1) accessible and executable "anytime, anywhere," through standard Web browsers and (2) integrated in that it provides tools that adhere to a common underlying methodology for parallel programming and performance tuning. The environment is based on a new network computing infrastructure, developed at Purdue University. We evaluate our environment qualitatively by comparing our tool access method with conventional schemes of software download and installation. We also quantitatively evaluate the efficiency of interactive tool access in our environment. We do this by measuring the response times of various functions of the URSA MINOR tool and compare them with those of a Java Applet-based "anytime, anywhere" tool access method. We found that our environment offers significant advantages in terms of tool accessibility, integration, and efficiency.

我们提出了一种新的并行编程工具环境，它可以(1)通过标准Web浏览器“随时随地”访问和执行，并且(2)集成在其中，它提供了遵循并行编程和性能调优的通用底层方法的工具。该环境基于普渡大学开发的一种新的网络计算基础设施。通过将我们的工具访问方法与传统的软件下载和安装方案进行比较，我们对我们的环境进行了定性评估。我们还定量地评估了在我们的环境中交互式工具访问的效率。我们通过测量URSA MINOR工具的各种功能的响应时间，并将它们与基于Java applet的“随时随地”工具访问方法进行比较来做到这一点。我们发现，我们的环境在工具可访问性、集成和效率方面提供了显著的优势。

引用次数: 11

Performance Modeling and Tuning of an Unstructured Mesh CFD Application 非结构化网格CFD应用的性能建模与优化

ACM/IEEE SC 2000 Conference (SC'00)

Pub Date : 2000-11-01 DOI: 10.5555/370049.370405

W. Gropp, D. Kaushik, D. Keyes, Barry F. Smith

This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific machines, the SGI Origin, the Cray T3E, and Beowulf clusters. The code achieves a respectable level of performance for sparse problems, typical of scientific and engineering codes based on partial differential equations, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop-orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters, and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation.

本文描述了来自NASA的三维非结构化网格欧拉流代码的性能调优经验，我们已经在PETSc框架中重新实现了该代码，并将其移植到几个大型机器上，包括ASCI Red和Blue Pacific机器、SGI Origin、Cray T3E和Beowulf集群。对于稀疏问题(基于偏微分方程的典型科学和工程代码)，代码达到了相当高的性能水平，并且可以扩展到数千个处理器。由于CPU速度和内存访问速率之间的差距正在扩大，因此从以内存为中心的角度(与传统的面向闪存的角度相反)分析代码，以了解其顺序和并行性能。性能调优在三个方面进行:数据布局以增强引用的局域性、算法参数和并行编程模型。这项工作部分是由为稀疏矩阵-向量乘积运算开发的一些简单性能模型指导的。

引用次数: 71

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM/IEEE SC 2000 Conference (SC'00)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀