ACM/IEEE SC 2002 Conference (SC'02)最新文献

英文中文

An Empirical Performance Evaluation of Scalable Scientific Applications 可扩展科学应用的实证性能评价

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10036

J. Vetter, A. Yoo

We investigate the scalability, architectural requirements,a nd performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, we refine our analysis into precise explanations of the factors that influence performance and scalability for each application; we distill these factors into common traits and overall recommendations for both users and designers of scalable platforms. Our experiments demonstrate that some traits, such as improvements in the scaling and performance of MPI's collective operations, will benefit most applications. We also find specific characteristics of some applications that limit performance. For example, one application's intensive use of a 64-bit, floating-point divide instruction, which has high latency and is not pipelined on the POWER3, limits the performance of the application's primary computation.

我们研究了八个可扩展科学应用程序的可扩展性、架构需求和性能特征。我们的分析是由经验测量驱动的，使用统计和跟踪仪器进行通信和计算。基于这些测量，我们将分析细化为影响每个应用程序性能和可伸缩性的因素的精确解释;我们将这些因素提炼为可扩展平台的用户和设计师的共同特征和总体建议。我们的实验表明，一些特征，如MPI集体操作的扩展和性能的改进，将使大多数应用受益。我们还发现了一些限制性能的应用程序的特定特征。例如，一个应用程序大量使用64位浮点除法指令，该指令具有高延迟，并且没有在POWER3上实现流水线化，从而限制了应用程序主要计算的性能。

引用次数: 80

Salinas: A Scalable Software for High-Performance Structural and Solid Mechanics Simulations Salinas:用于高性能结构和固体力学模拟的可扩展软件

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10028

M. Bhardwaj, K. Pierson, G. Reese, T. Walsh, D. Day, K. Alvin, J. Peery, C. Farhat, M. Lesoinne

We present Salinas, a scalable implicit software application for the finite element static and dynamic analysis of complex structural real-world systems. This relatively complete engineering software with more than 100,000 lines of C++ code and a long list of users sustains 292.5 Gflop/s on 2,940 ASCI Red processors, and 1.16 Tflop/s on 3,375 ASCI White processors.

我们提出了Salinas，一个可扩展的隐式软件应用程序，用于复杂结构系统的有限元静态和动态分析。这个相对完整的工程软件拥有超过10万行c++代码和长长的用户列表，在2,940个ASCI红色处理器上维持292.5 Gflop/s，在3,375个ASCI白色处理器上维持1.16 Tflop/s。

引用次数: 89

Giggle: A Framework for Constructing Scalable Replica Location Services Giggle:构建可伸缩副本位置服务的框架

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10024

A. Chervenak, E. Deelman, Ian T Foster, Leanne P. Guy, Wolfgang Hoschek, Adriana Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, Robert Schwartzkopf, H. Stockinger, Kurt Stockinger, B. Tierney

In wide area computing systems, it is often desirable to create remote read-only copies (replicas) of files. Replication can be used to reduce access latency, improve data locality, and/or increase robustness, scalability and performance for distributed applications. We define a replica location service (RLS) as a system that maintains and provides access to information about the physical locations of copies. An RLS typically functions as one component of a data grid architecture. This paper makes the following contributions. First, we characterize RLS requirements. Next, we describe a parameterized architectural framework, which we name Giggle (for GIGa-scale Global Location Engine), within which a wide range of RLSs can be defined. We define several concrete instantiations of this framework with different performance characteristics. Finally, we present initial performance results for an RLS prototype, demonstrating that RLS systems can be constructed that meet performance goals.

在广域计算系统中，通常需要创建文件的远程只读副本(副本)。复制可用于减少访问延迟、改善数据局域性和/或增加分布式应用程序的健壮性、可伸缩性和性能。我们将副本位置服务(RLS)定义为维护和提供对副本物理位置信息的访问的系统。RLS通常用作数据网格体系结构的一个组件。本文做了以下贡献。首先，我们描述RLS需求。接下来，我们描述了一个参数化的体系结构框架，我们将其命名为Giggle(用于GIGa-scale Global Location Engine)，在其中可以定义广泛的rls。我们定义了这个框架的几个具有不同性能特征的具体实例。最后，我们给出了RLS原型的初始性能结果，证明RLS系统可以构建满足性能目标的系统。

引用次数: 477

Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces 使用内存映射网络接口的smp集群上平铺嵌套循环的流水线调度

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10008

Maria Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC’s driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives.

本文描述了使用增强的网络接口实现低延迟通信所获得的性能优势。我们提出了一种新颖的流水线调度方法，它利用DMA通信模式，在cpu执行计算时将数据发送到其他节点。我们也使用零拷贝通信通过固定的物理内存区域，由网卡的驱动模块提供。我们的测试平台涉及在SMP节点集群上并行执行平铺嵌套循环，每个节点中都有单个PCI-SCI网卡。为了对贴片进行调度，我们对贴片空间进行了基于超平面的分组变换，将独立的相邻贴片分组在一起，并将它们分配到同一个SMP节点。实验评估表明，具有增强通信特性的内存映射网卡可以使用更高级的流水线(重叠)调度，与使用传统的、CPU和内核受限的通信原语实现的普通阻塞调度相比，这大大提高了性能。

引用次数: 11

Efficient Synchronization for Nonuniform Communication Architectures 非统一通信体系结构的高效同步

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10038

Z. Radovic, Erik Hagersten

Scalable parallel computers are often nonuniform communication architectures (NUCAs), where the access time to other processor’s caches vary with their physical location. Still, few attempts of exploring cache-to-cache communication locality have been made. This paper introduces a new kind of synchronization primitives (lock-unlock) that favor neighboring processors when a lock is released. This improves the lock handover time as well as access time to the shared data of the critical region. A critical section guarded by our new RH lock takes less than half the time to execute compared with the same critical section guarded by any other lock on our NUCA hardware. The execution time for Raytrace with 28 processors was improved 2.23 - 4.68 times, while global traffic was dramatically decreased compared with all the other locks. The average execution time was improved 7 - 24% while the global traffic was decreased 8 - 28% for an average over the seven applications studied.

可伸缩并行计算机通常是非统一通信体系结构(nuca)，其中对其他处理器缓存的访问时间随其物理位置而变化。然而，很少有人尝试探索缓存到缓存通信的局部性。本文介绍了一种新的同步原语(锁-解锁)，当锁被释放时，它有利于相邻的处理器。这提高了锁切换时间以及对关键区域共享数据的访问时间。与我们的NUCA硬件上任何其他锁所保护的相同临界区相比，我们的新RH锁所保护的临界区执行时间不到一半。具有28个处理器的Raytrace的执行时间提高了2.23 - 4.68倍，而与所有其他锁相比，全局流量大幅减少。在研究的7个应用程序中，平均执行时间提高了7 - 24%，而全球流量平均减少了8 - 28%。

引用次数: 22

An Overview of the BlueGene/L Supercomputer BlueGene/L超级计算机概述

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.5555/762761.762787

N. Adiga, G. Almási, G. Almási, Y. Aridor, R. Barik, D. Beece, Ralph Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, Calin Cascaval, J. Castaños, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, Dong Chen, G. Chiu, T. Cipolla, P. Crumley, K. Desai, A. Deutsch, T. Domany, M. B. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopalsamy, John A. Gunnels, Manish Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. P. Lanzetta, D. Lieber, L. K. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Moreira, B. J. Nathanson, M. Newton, M. Ohmacht, A. Oliner, Vinayaka Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, Edi Shmueli, Sarabjeet Singh, Peilin Song, V. Srinivasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. T

This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions,including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004-2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.

本文介绍了BlueGene/L超级计算机的概况。这是IBM和Lawrence Livermore国家实验室联合资助的研究伙伴关系，是美国能源部ASCI高级架构研究计划的一部分。最近，一些学术和政府机构(包括圣地亚哥超级计算机中心和加州理工学院)的合作伙伴启动了应用程序性能和扩展研究。这个拥有65,536个节点的大规模并行系统基于一种新的架构，该架构利用单片系统技术提供360万亿次浮点运算(每秒万亿次浮点运算)的目标峰值处理能力。该机器计划在2004-2005年的时间框架内投入使用，其价格/性能和功耗/性能目标是传统架构无法实现的。

{"title":"An Overview of the BlueGene/L Supercomputer","authors":"N. Adiga, G. Almási, G. Almási, Y. Aridor, R. Barik, D. Beece, Ralph Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, Calin Cascaval, J. Castaños, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, Dong Chen, G. Chiu, T. Cipolla, P. Crumley, K. Desai, A. Deutsch, T. Domany, M. B. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopalsamy, John A. Gunnels, Manish Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. P. Lanzetta, D. Lieber, L. K. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Moreira, B. J. Nathanson, M. Newton, M. Ohmacht, A. Oliner, Vinayaka Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, Edi Shmueli, Sarabjeet Singh, Peilin Song, V. Srinivasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. T","doi":"10.5555/762761.762787","DOIUrl":"https://doi.org/10.5555/762761.762787","url":null,"abstract":"This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions,including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004-2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"62 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132845472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 567

A TCP Tuning Daemon TCP调优守护进程

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10023

T. Dunigan, M. Mathis, B. Tierney

Many high performance distributed applications require high network throughput but are able to achieve only a small fraction of the available bandwidth. A common cause of this problem is improperly tuned network settings. Tuning techniques, such as setting the correct TCP buffers and using parallel streams, are well known in the networking community, but outside the networking community they are infrequently applied. In this paper, we describe a tuning daemon that uses TCP instrumentation data from the Unix kernel to transparently tune TCP parameters for specified individual flows over designated paths. No modifications are required to the application, and the user does not need to understand network or TCP characteristics.

许多高性能分布式应用程序需要高网络吞吐量，但只能实现可用带宽的一小部分。导致此问题的一个常见原因是网络设置调优不当。调优技术，如设置正确的TCP缓冲区和使用并行流，在网络社区中是众所周知的，但在网络社区之外，它们很少被应用。在本文中，我们描述了一个调优守护进程，它使用来自Unix内核的TCP检测数据，为指定路径上的指定单个流透明地调优TCP参数。不需要对应用程序进行修改，用户也不需要了解网络或TCP特征。

引用次数: 92

Improving Route Lookup Performance Using Network Processor Cache 利用网络处理器缓存改进路由查找性能

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10006

Kartik Gopalan, T. Chiueh

Earlier research has shown that the route lookup performance of a network processor can be significantly improved by caching ranges of lookup/classification keys rather than individual keys. While the previous work focused specifically on reducing capacity misses, we address two other important aspects - (a) reducing conflict misses and (b) cache consistency during frequent route updates. We propose two techniques to minimize conflict misses that aim to balance the number of cacheable entries mapped to each cache set. They offer different tradeoffs between performance and simplicity while improving the average route lookup time by 76% and 45.2% respectively. To maintain cache consistency during frequent route updates, we propose a selective cache invalidation technique that can limit the degradation in lookup latency to within 10.2%. Our results indicate potentially large improvement in lookup performance for network processors used at Internet edge and motivate further research into caching at the Internet core.

早期的研究表明，通过缓存查找/分类键的范围而不是单个键，可以显著提高网络处理器的路由查找性能。虽然之前的工作主要集中在减少容量遗漏上，但我们解决了另外两个重要方面——(a)减少冲突遗漏和(b)频繁路由更新期间的缓存一致性。我们提出了两种技术来最小化冲突缺失，旨在平衡映射到每个缓存集的可缓存条目的数量。它们在性能和简单性之间进行了不同的权衡，同时分别将平均路由查找时间提高了76%和45.2%。为了在频繁的路由更新期间保持缓存一致性，我们提出了一种选择性缓存失效技术，可以将查找延迟的退化限制在10.2%以内。我们的研究结果表明，在互联网边缘使用的网络处理器的查找性能可能会有很大的改善，并激发了对互联网核心缓存的进一步研究。

引用次数: 34

Accelerating Parallel Maximum Likelihood-Based Phylogenetic Tree Calculations Using Subtree Equality Vectors 利用子树相等向量加速基于并行最大似然的系统发育树计算

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10016

A. Stamatakis, T. Ludwig, H. Meier, Marty J. Wolf

Heuristics for calculating phylogenetic trees for a large sets of aligned rRNA sequences based on the maximum likelihood method are computationally expensive. The core of most parallel algorithms, which accounts for the greatest part of computation time, is the tree evaluation function, that calculates the likelihood value for each tree topology. This paper describes and uses Subtree Equality Vectors (SEVs) to reduce the number of required floating point operations during topology evaluation. We integrated our optimizations into various sequential programs and into parallel fastDNAml, one of the most common and efficient parallel programs for calculating large phylogenetic trees. Experimental results for our parallel program, which renders exactly the same output as parallel fastDNAml show global runtime improvements of 26% to 65%. The optimization scales best on clusters of PCs, which also implies a substantial cost saving factor for the determination of large trees.

基于最大似然法计算大量排列rRNA序列的系统发育树的启发式方法计算成本很高。大多数并行算法的核心是树评价函数，它计算每个树拓扑的似然值，占计算时间的大部分。本文描述并使用子树相等向量(sev)来减少拓扑计算过程中所需的浮点运算次数。我们将我们的优化集成到各种顺序程序和并行fastDNAml中，fastDNAml是用于计算大型系统发育树的最常见和最有效的并行程序之一。我们的并行程序的实验结果显示，与并行fastDNAml呈现完全相同的输出，全局运行时间改善了26%到65%。这种优化在pc集群上的可伸缩性最好，这也意味着在确定大型树时可以节省大量成本。

引用次数: 33

A New Data-Mapping Scheme for Latency-Tolerant Distributed Sparse Triangular Solution 容延迟分布稀疏三角解的一种新的数据映射方案

ACM/IEEE SC 2002 Conference (SC'02)

Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10020

K. Teranishi, P. Raghavan, E. Ng

This paper concerns latency-tolerant schemes for the efficient parallel solution of sparse triangular linear systems on distributed memory multiprocessors. Such triangular solution is required when sparse Cholesky factors are used to solve for a sequence of right-hand-side vectors or when incomplete sparse Cholesky factors are used to precondition a Conjugate Gradients iterative solver. In such applications, the use of traditional distributed substitution schemes can create a performance bottleneck when the latency of interprocessor communication is large. We had earlier developed the Selective Inversion (SI) scheme to reduce communication latency costs by replacing distributed substitution by parallel matrix vector multiplication. We now present a new two-way mapping of the triangular sparse matrix to processors to improve the performance of SI by halving its communication latency costs. We provide analytic results for model sparse matrices and we report on the performance of our scheme for parallel preconditioning with incomplete sparse Cholesky factors.

本文研究了分布式存储多处理机上稀疏三角形线性系统高效并行解的容延迟方案。当使用稀疏Cholesky因子求解右侧向量序列或使用不完全稀疏Cholesky因子作为共轭梯度迭代求解器的先决条件时，需要这样的三角形解。在这样的应用程序中，当处理器间通信的延迟很大时，使用传统的分布式替代方案可能会造成性能瓶颈。我们之前已经开发了选择性反演(SI)方案，通过并行矩阵向量乘法取代分布式替换来降低通信延迟成本。我们现在提出了一种新的三角稀疏矩阵到处理器的双向映射，通过将其通信延迟成本减半来提高SI的性能。我们给出了模型稀疏矩阵的解析结果，并报告了我们的方案在具有不完全稀疏Cholesky因子的并行预处理中的性能。

引用次数: 12

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM/IEEE SC 2002 Conference (SC'02)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀