Proceedings 11th International Parallel Processing Symposium最新文献

英文中文

Architecture and performance of the Hitachi SR2201 massively parallel processor system 日立SR2201大规模并行处理器系统的结构与性能

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580901

Hiroaki Fujii, Y. Yasuda, Hideya Akashi, Y. Inagami, Makoto Koga, Osamu Ishihara, M. Kashiyama, Hideo Wada, Tsutomu Sumimoto

RISC-based Massively Parallel Processors (MPPs) often show low efficiency in real-world applications because of cache miss penalty, insufficient throughput of the memory system, and poor inter-processor communication performance. Hitachi's SR2201, an MPP scalable up to 2048 processors and 600 GFLOPS peak performance, overcomes these problems by introducing three novel features. First, its processor the 150 MHz HARP-IE, solves the cache miss penalty by "pseudo vector processing" (PVP). In PVP, data is loaded by prefetching to a special register bank, bypassing the cache. Second, a multi-bank memory architecture that operates like a pipeline eliminates the memory system bottleneck. Third, the inter-processor communication achieves high performance on the three-dimensional crossbar network, using a "remote DMA transfer" protocol and a hardware-based cache coherency. As the result of these improvements, the SR2201 achieved 220.4 GFLOPS with 1024 processors in the LINPACK benchmark, which is almost 72% of the peak performance.

基于risc的大规模并行处理器(Massively Parallel processor, mpp)在实际应用中由于缓存丢失、内存系统吞吐量不足和处理器间通信性能差等原因，往往表现出较低的效率。日立SR2201是一款可扩展到2048个处理器和600 GFLOPS峰值性能的MPP，通过引入三个新功能克服了这些问题。首先，它的150 MHz HARP-IE处理器通过“伪矢量处理”(PVP)解决了缓存缺失的问题。在PVP中，数据通过预取加载到一个特殊的寄存器库，绕过缓存。其次，像管道一样操作的多银行内存架构消除了内存系统瓶颈。第三，利用“远程DMA传输”协议和基于硬件的缓存一致性，处理器间通信在三维交叉条网络上实现了高性能。由于这些改进，SR2201在LINPACK基准测试中使用1024个处理器实现了220.4 GFLOPS，几乎是峰值性能的72%。

{"title":"Architecture and performance of the Hitachi SR2201 massively parallel processor system","authors":"Hiroaki Fujii, Y. Yasuda, Hideya Akashi, Y. Inagami, Makoto Koga, Osamu Ishihara, M. Kashiyama, Hideo Wada, Tsutomu Sumimoto","doi":"10.1109/IPPS.1997.580901","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580901","url":null,"abstract":"RISC-based Massively Parallel Processors (MPPs) often show low efficiency in real-world applications because of cache miss penalty, insufficient throughput of the memory system, and poor inter-processor communication performance. Hitachi's SR2201, an MPP scalable up to 2048 processors and 600 GFLOPS peak performance, overcomes these problems by introducing three novel features. First, its processor the 150 MHz HARP-IE, solves the cache miss penalty by \"pseudo vector processing\" (PVP). In PVP, data is loaded by prefetching to a special register bank, bypassing the cache. Second, a multi-bank memory architecture that operates like a pipeline eliminates the memory system bottleneck. Third, the inter-processor communication achieves high performance on the three-dimensional crossbar network, using a \"remote DMA transfer\" protocol and a hardware-based cache coherency. As the result of these improvements, the SR2201 achieved 220.4 GFLOPS with 1024 processors in the LINPACK benchmark, which is almost 72% of the peak performance.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124767918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Design and implementation of virtual memory-mapped communication on Myrinet Myrinet上虚拟内存映射通信的设计与实现

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580931

C. Dubnicki, A. Bilas, Kai Li

Describes the design and implementation of the Virtual Memory-Mapped Communication (VMMC) model on a Myrinet network of PCI-based PCs. VMMC has been designed and implemented for the SHRIMP multicomputer, where it delivers user-to-user latency and bandwidth close to the limits imposed by the underlying hardware. The goal of this work is: to provide an implementation of VMMC on a commercially available hardware platform; to determine whether the benefits of VMMC can be realized on the new hardware; and to investigate network interface design tradeoffs by comparing SHRIMP with Myrinet and its respective VMMC implementation. Our Myrinet implementation of VMMC achieves 9.8 /spl mu/s one-way latency and provides 108.4 MByte/s user-to-user bandwidth. Compared to SHRIMP, the Myrinet implementation of VMMC incurs relatively higher overhead and demands more network interface resources (LANai processor, on-board SRAM) but requires less operating system support.

描述了基于pci的pc的Myrinet网络上虚拟内存映射通信(VMMC)模型的设计和实现。VMMC是为SHRIMP多计算机设计和实现的，它提供了接近底层硬件限制的用户对用户延迟和带宽。这项工作的目标是:在商用硬件平台上提供VMMC的实现;确定VMMC的优势能否在新硬件上实现;并通过比较SHRIMP与Myrinet及其各自的VMMC实现来研究网络接口设计的权衡。我们的VMMC的Myrinet实现实现了9.8 /spl mu/s的单向延迟，并提供108.4 MByte/s的用户对用户带宽。与SHRIMP相比，VMMC的Myrinet实现会产生相对较高的开销，并且需要更多的网络接口资源(LANai处理器、板载SRAM)，但需要较少的操作系统支持。

引用次数: 87

Enhancing software DSM for compiler-parallelized applications 为编译器并行化应用程序增强软件DSM

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580945

P. Keleher, C. Tseng

Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed shared-memory (DSM) systems. We demonstrate such a system by combining the SUIF parallelizing compiler and the CVM software DSM. Innovations of the system include compiler-directed techniques that: (1) combine synchronization and parallelism information communication on parallel task invocation, (2) employ customized routines for evaluating reduction operations, and (3) select a hybrid update protocol that pre-sends data by flushing updates at barriers. For applications with sufficient granularity of parallelism, these optimizations yield very good eight processor speedups on an IBM SP-2 and DEC Alpha cluster usually matching or exceeding the speedup of equivalent HPF and message-passing versions of each program. Flushing updates, in particular, eliminates almost all nonlocal memory misses and improves performance by 13% on average.

当前用于消息传递机器的并行编译器仅支持有限的一类数据并行应用程序。消除这种限制的一种方法是将功能强大的共享内存并行编译器与软件分布式共享内存(DSM)系统结合起来。我们通过将SUIF并行编译器和CVM软件DSM相结合来演示这样一个系统。该系统的创新包括编译器导向的技术:(1)在并行任务调用中结合同步和并行信息通信，(2)采用自定义例程来评估约简操作，以及(3)选择混合更新协议，通过在屏障处刷新更新来预先发送数据。对于具有足够并行度粒度的应用程序，这些优化在IBM SP-2和DEC Alpha集群上产生非常好的8个处理器加速，通常匹配或超过每个程序的等效HPF和消息传递版本的加速。特别是，刷新更新消除了几乎所有的非本地内存丢失，并平均提高了13%的性能。

引用次数: 54

Joining forces in solving large-scale quadratic assignment problems in parallel 联合求解大规模并行二次分配问题

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580936

Adrian Brüngger, Ambros Marzetta, J. Clausen, Michael Perregaard

Program libraries are one way to make the cooperation between specialists from various fields successful: the separation of application-specific knowledge from application independent tasks ensures portability, maintenance, extensibility, and flexibility. This paper demonstrates the success in combining problem-specific knowledge for the quadratic assignment problem (QAP) with the raw computing power offered by contemporary parallel hardware by using the library of parallel search algorithms ZRAM. The solutions of 10 previously unsolved large standard test-instances of the QAP are presented.

程序库是使来自不同领域的专家之间成功合作的一种方式:将特定于应用程序的知识与独立于应用程序的任务分离，确保了可移植性、可维护性、可扩展性和灵活性。本文通过使用并行搜索算法库ZRAM，成功地将二次分配问题(QAP)的问题特定知识与当代并行硬件提供的原始计算能力相结合。给出了10个以前未解决的QAP大型标准测试实例的解决方案。

引用次数: 37

Parallel solutions of indexed recurrence equations 索引递归方程的并行解

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580935

Gadi Haber, Y. Ben-Asher

A new type of recurrence equations called "indexed recurrences" (IR) is defined in which the common notion of X[i]=op(X[i],X[i-1]) i=1...n is generalized to X[g(i)]=op(X[f(i)],X[h(i)]) f,g,h:{1...n}/spl rarr/{1...m}. This enables us to model sequential loops of the form for i=1 to n do begin X[g(i)]:=op(X[f(i)],X[h(i)];) as IR equations. Thus, a parallel algorithm that solves a set of IR equations is in fact a way to transform sequential loops into parallel ones. Note that the circuit evaluation problem (CVP) can also be expressed as a set of IR equations. Therefore an efficient parallel solution to the general IR problem is not likely to be found. As such solution would also solve the CVP, showing that P/spl sube/NC. In this paper we introduce parallel algorithms for two variants of the IR equations problem: An O(log n) greedy algorithm for solving IR equations where g(i) is distinct and h(i)=g(i) using O(n) processors. An O(log/sup 2/ n) algorithm with no restriction on f, g or h, using up to O(n/sup 2/) processors. However we show that for general IR, op must be commutative so that a parallel computation can be used.

定义了一类新的递归方程，称为“索引递归”(IR)，其中X[i]=op(X[i]，X[i-1]) i=1…n是广义X [g (i)] = op (X [f (i)], [h (i)]) f, g, h:{1…n} / spl rarr / {1 m…}。这使我们能够对i=1到n的形式的顺序循环进行建模，并开始X[g(i)]:=op(X[f(i)]，X[h(i)];)作为IR方程。因此，解决一组IR方程的并行算法实际上是将顺序循环转换为并行循环的一种方法。注意，电路评估问题(CVP)也可以表示为一组IR方程。因此，一般红外问题的有效并行解决方案不太可能找到。这样的解决方案也可以解决CVP，表明P/spl sub /NC。本文介绍了求解IR方程问题的两种并行算法:一种用O(n)个处理器求解g(i)不同且h(i)=g(i)的IR方程的O(log n)贪婪算法。O(log/sup 2/ n)算法，不限制f, g或h，使用最多O(n/sup 2/)个处理器。然而，我们证明了对于一般IR, op必须是可交换的，以便可以使用并行计算。

引用次数: 5

Latency tolerance: a metric for performance analysis of multithreaded architectures 延迟容忍度:用于多线程架构性能分析的指标

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580899

S. Nemawarkar, G. Gao

Multithreaded multiprocessor systems (MMS) have been proposed to tolerate long latencies for communication. This paper provides an analytical framework based on closed queueing networks to quantify and analyze the latency tolerance of multithreaded systems. We introduce a new metric, called the tolerance index, which quantifies the closeness of performance of the system to that of an ideal system. We characterize the latency tolerance with the changes in the architectural and program workload parameters. We show how an analysis of the latency tolerance provides an insight to the performance optimizations of fine grain parallel program workloads.

已经提出了多线程多处理器系统(MMS)来容忍长时间的通信延迟。本文提出了一个基于封闭排队网络的分析框架来量化和分析多线程系统的延迟容忍度。我们引入了一个新的度量，称为容差指数，它量化了系统性能与理想系统性能的接近程度。我们用架构和程序工作负载参数的变化来描述延迟容忍度。我们将展示对延迟容忍度的分析如何为细粒度并行程序工作负载的性能优化提供洞见。

引用次数: 5

A customizable simulator for workstation networks 一个可定制的工作站网络模拟器

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580903

Mustafa Uysal, A. Acharya, R. Bennett, J. Saltz

We present a customizable simulator called netsim for high performance point to point workstation networks that is accurate enough to be used for application level performance analysis, yet is easy enough to customize for multiple architectures and software configurations. Customization is accomplished without using any proprietary information, using only publicly available hardware specifications and information that can be readily determined using a suite of test programs. We customized netsim for two platforms: a 16 node IBM SP-2 with a multistage network and a 10 node DEC Alpha Farm with an ATM switch. We show that netsim successfully models these two architectures with a 2-6% error on the SP-2 and less than 10% error on the Alpha Farm for most test cases. It achieves this accuracy at the cost of a 7-36 fold simulation slowdown with respect to the SP-2 and a 3-8 fold slowdown with respect to the Alpha Farm.

我们提出了一个名为netsim的可定制模拟器，用于高性能点对点工作站网络，它足够精确，可用于应用程序级性能分析，同时也很容易为多种架构和软件配置进行定制。定制是在不使用任何专有信息的情况下完成的，只使用公开可用的硬件规范和信息，这些信息可以通过一套测试程序轻松确定。我们为两个平台定制了netsim:一个带有多级网络的16节点IBM SP-2和一个带有ATM交换机的10节点DEC Alpha Farm。我们表明，netsim成功地在SP-2上以2-6%的误差对这两种体系结构进行建模，而在Alpha Farm上对大多数测试用例的误差小于10%。它以相对于SP-2的7-36倍模拟速度和相对于Alpha Farm的3-8倍速度为代价实现了这种精度。

引用次数: 16

Aurora: scoped behaviour for per-context optimized distributed data sharing Aurora:按上下文优化的分布式数据共享的作用域行为

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580942

P. Lu

We introduce the all-software, standard C++-based Aurora distributed shared data system. As with related systems, it provides a shared data abstraction on distributed memory hardware. An innovation in Aurora is the use of scoped behaviour for per-context data sharing optimizations (i.e., portion of source code, such as a loop or phase). With scoped behaviour a new language scope (e.g., nested braces) can be used to optimize the data sharing behaviour of the selected source code. Different scopes and different shared data can be optimized in different ways. Thus, scoped behaviour provides a novel level of flexibility to incrementally tune the parallel performance of an application.

介绍了基于c++的全软件、标准的Aurora分布式共享数据系统。与相关系统一样，它在分布式内存硬件上提供共享数据抽象。Aurora的一个创新之处在于为每个上下文的数据共享优化(即，部分源代码，如循环或阶段)使用了作用域行为。有了作用域行为，可以使用一个新的语言作用域(例如，嵌套大括号)来优化所选源代码的数据共享行为。不同的作用域和不同的共享数据可以以不同的方式进行优化。因此，限定作用域的行为提供了一种新的灵活性，可以增量地调优应用程序的并行性能。

引用次数: 12

A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers 分布式存储并发计算机上快速可扩展的通用矩阵乘法算法

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580916

J. Choi

The author presents a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA (distribution-independent matrix multiplication algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS routine in each processor when the block size is too small as well as too large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer.

本文提出了一种适用于分布式内存并发计算机的快速可扩展矩阵乘法算法，该算法的性能与处理器上的数据分布无关，并将其命名为DIMMA(分布无关矩阵乘法算法)。该算法基于两个新思想;采用改进的流水线通信方案，有效地实现了计算和通信的重叠，并利用LCM块的概念，在块大小过大和过小的情况下，使每个处理器上的顺序BLAS例程的性能达到最大。在Intel Paragon计算机上实现了该算法，并与SUMMA进行了比较。

引用次数: 29

Characterization of deadlocks in interconnection networks 互连网络中死锁的表征

Proceedings 11th International Parallel Processing Symposium

Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580852

Sugath Warnakulasuriya, T. Pinkston

Deadlock-free routing algorithms have been developed recently without fully understanding the frequency and characteristics of deadlocks. Using a simulator capable of true deadlock detection, we measure a network's susceptibility to deadlock due to various design parameters. The effects of bidirectionality, routing adaptivity, virtual channels, buffer size and node degree on deadlock formation are studied. In the process, we provide insight into the frequency and characteristics of deadlocks and the relationship between routing flexibility blocked messages, resource dependencies and the degree of correlation needed to form deadlock.

最近开发的无死锁路由算法没有充分了解死锁的频率和特征。使用能够真正检测死锁的模拟器，我们测量了由于各种设计参数导致的网络对死锁的易感性。研究了双向性、路由自适应、虚拟信道、缓冲区大小和节点度对死锁形成的影响。在此过程中，我们深入了解了死锁的频率和特征，以及路由灵活性、被阻塞消息、资源依赖性和形成死锁所需的相关性之间的关系。

引用次数: 37

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings 11th International Parallel Processing Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀