Proceedings. 15th Symposium on Computer Architecture and High Performance Computing最新文献

英文中文

PM/sup 2/P: a tool for performance monitoring of message passing applications in COTS PC clusters PM/sup 2/P:用于监控COTS PC集群中消息传递应用程序的性能的工具

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250341

Maya Haridasan, G. H. Pfitscher

The use of clusters of computers as an environment for high performance computing has been shown to be promising. However, the efficient use of such systems still requires advances that make the application development process be simpler and more productive. The development of cluster monitoring tools is essential to achieve this advances. We present PM/sup 2/P, a tool for use in clusters of personal computers that provides a graphic visualization of the temporal execution of distributed applications that use the MPI standard for message passing. The tool uses an approach involving the parallel port to read the time of events that occur in all different machines of a cluster. It also simulates the execution of task precedence graphs and allocates tasks of a graph to the machines of a cluster, among other functionalities.

使用计算机集群作为高性能计算环境已被证明是有前途的。然而，有效地使用这些系统仍然需要使应用程序开发过程更简单和更高效的进步。集群监控工具的开发对于实现这一进步至关重要。我们提出了PM/sup 2/P，这是一个用于个人计算机集群的工具，它提供了使用MPI标准进行消息传递的分布式应用程序的时间执行的图形可视化。该工具使用一种涉及并行端口的方法来读取集群中所有不同机器中发生的事件的时间。它还模拟任务优先图的执行，并将图的任务分配给集群的机器，以及其他功能。

引用次数: 2

Exploring memory hierarchy with ArchC 用ArchC探索内存层次结构

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250315

Pablo Viana, E. Barros, S. Rigo, R. Azevedo, G. Araújo

We present the cache configuration exploration of a programmable system, in order to find the best matching between the architecture and a given application. Here, programmable systems composed by processor and memories may be rapidly simulated making use of ArchC, an architecture description language (ADL) based on SystemC. Initially designed to model processor architectures, ArchC was extended to support a more detailed description of the memory subsystem, allowing the design space exploration of the whole programmable system. As an example, it is shown an image processing application, running on a SPARC-V8 processor-based architecture, which had its memory organization adjusted to minimize cache misses.

我们提出了一个可编程系统的缓存配置探索，以找到架构和给定应用之间的最佳匹配。在这里，利用基于SystemC的体系结构描述语言(ADL) ArchC，可以快速模拟由处理器和存储器组成的可编程系统。最初设计为处理器架构建模，ArchC被扩展为支持更详细的内存子系统描述，允许整个可编程系统的设计空间探索。作为一个示例，它展示了一个图像处理应用程序，运行在基于SPARC-V8处理器的体系结构上，该体系结构调整了其内存组织以最小化缓存丢失。

引用次数: 14

Complex branch profiling for dynamic conditional execution 用于动态条件执行的复杂分支分析

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250318

R. Santos, T. Santos, M. Pilla, P. Navaux, S. Bampi, M. Nemirovsky

Branch predictors are widely used as an alternative to deal with conditional branches. Despite the high accuracy rates, misprediction penalties are still large in any superscalar pipeline. DCE, or dynamic conditional execution, is an alternative to reduce the number of predicted branches by executing both paths of certain branches, reducing the number of predictions and, therefore, the occurrence of mispredictions. The goal of this work is to analyze the complexity of branch structures and determine the number of branches that can be predicated in DCE and the distribution of mispredictions according to the proposed classification. The complex branch classification proposed extends the classification presented by Klauser [A. Klauser, et al., (1998)]. As result, we show that an average of 35% of all branches can be predicated in DCE and around 32% of all mispredictions fall into these branches.

分支预测器被广泛用作处理条件分支的替代方法。尽管准确率很高，但在任何超标量管道中，错误预测的惩罚仍然很大。DCE，即动态条件执行，是通过执行某些分支的两条路径来减少预测分支数量的一种替代方法，从而减少预测的数量，从而减少错误预测的发生。这项工作的目标是分析分支结构的复杂性，并根据提出的分类确定可以在DCE中预测的分支数量和错误预测的分布。提出的复杂分支分类扩展了Klauser [A.]提出的分类。Klauser, et al.，(1998)。结果，我们表明，平均35%的分支可以在DCE中预测，大约32%的错误预测属于这些分支。

引用次数: 6

Performance issues of bandwidth reservations for grid computing 网格计算带宽预留的性能问题

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250324

Lars-Olof Burchard, Hans-Ulrich Heiß, C. Rose

In general, two types of resource reservations in computer networks can be distinguished: immediate reservations which are made in a just-in-time manner and advance reservations which allow to reserve resources a long time before they are actually used. Advance reservations are especially useful for grid computing but also for a variety of other applications that require network quality-of-service, such as content distribution networks or even mobile clients, which need advance reservation to support handovers for streaming video. With the emerged MPLS standard, explicit routing can be implemented also in IP networks, thus overcoming the unpredictable routing behavior which so far prevented the implementation of advance reservation services. The impact of such advance reservation mechanisms on the performance of the network with respect to the amount of admitted requests and the allocated bandwidth has so far not been examined in detail. We show that advance reservations can lead to a reduced performance of the network with respect to both metrics. The analysis of the reasons shows a fragmentation of the network resources. In advance reservation environments, additional new services can be defined such as malleable reservations and can lead to an increased performance of the network. Four strategies for scheduling malleable reservations are presented and compared. The results of the comparisons show that some strategies increase the resource fragmentation and are therefore unsuitable in the considered environment while others lead to a significantly better performance of the network. Besides discussing the performance issue, the software architecture of a management system for advance reservations is presented.

一般来说，在计算机网络中可以区分两种类型的资源预订:即时预订和提前预订，前者以及时的方式进行预订，后者允许在资源实际使用之前很长一段时间预订资源。提前预订对于网格计算特别有用，但对于需要网络服务质量的各种其他应用程序也很有用，例如内容分发网络或甚至移动客户机，这些应用程序需要提前预订以支持流视频的切换。随着MPLS标准的出现，显式路由也可以在IP网络中实现，从而克服了迄今为止阻碍提前预约服务实现的不可预测的路由行为。到目前为止，还没有详细研究这种预先预留机制对网络性能在接收请求数量和分配带宽方面的影响。我们表明，提前预订会导致网络在这两个指标方面的性能下降。究其原因，可以看出网络资源的碎片化。在预先预订环境中，可以定义额外的新服务，例如可扩展预订，并且可以提高网络的性能。提出并比较了四种调度可延性预留的策略。比较结果表明，一些策略增加了资源碎片化，因此不适合所考虑的环境，而另一些策略则显著提高了网络的性能。在对系统性能问题进行讨论的基础上，提出了一个预约管理系统的软件架构。

{"title":"Performance issues of bandwidth reservations for grid computing","authors":"Lars-Olof Burchard, Hans-Ulrich Heiß, C. Rose","doi":"10.1109/CAHPC.2003.1250324","DOIUrl":"https://doi.org/10.1109/CAHPC.2003.1250324","url":null,"abstract":"In general, two types of resource reservations in computer networks can be distinguished: immediate reservations which are made in a just-in-time manner and advance reservations which allow to reserve resources a long time before they are actually used. Advance reservations are especially useful for grid computing but also for a variety of other applications that require network quality-of-service, such as content distribution networks or even mobile clients, which need advance reservation to support handovers for streaming video. With the emerged MPLS standard, explicit routing can be implemented also in IP networks, thus overcoming the unpredictable routing behavior which so far prevented the implementation of advance reservation services. The impact of such advance reservation mechanisms on the performance of the network with respect to the amount of admitted requests and the allocated bandwidth has so far not been examined in detail. We show that advance reservations can lead to a reduced performance of the network with respect to both metrics. The analysis of the reasons shows a fragmentation of the network resources. In advance reservation environments, additional new services can be defined such as malleable reservations and can lead to an increased performance of the network. Four strategies for scheduling malleable reservations are presented and compared. The results of the comparisons show that some strategies increase the resource fragmentation and are therefore unsuitable in the considered environment while others lead to a significantly better performance of the network. Besides discussing the performance issue, the software architecture of a management system for advance reservations is presented.","PeriodicalId":433002,"journal":{"name":"Proceedings. 15th Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126912758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 67

Three hardware implementations for the binary modular exponentiation: sequential, parallel and systolic 二进制模求幂的三种硬件实现:顺序、并行和收缩

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250344

N. Nedjah, L. M. Mourelle

Modular exponentiation is the cornerstone computation performed in public-key cryptography systems such as the RSA cryptosystem. The operation is time consuming for large operands. We describe the characteristics of three architectures designed to implement modular exponentiation using the fast binary method: the first FPGA prototype has a sequential architecture, the second has a parallel architecture and the third has a systolic array-based architecture. We compare the three prototypes using the time/spl times/area classic factor. All three prototypes implement the modular multiplication using the popular Montgomery algorithm.

模幂运算是公钥密码系统(如RSA密码系统)中执行的基础计算。对于大操作数，该操作非常耗时。我们描述了使用快速二进制方法实现模块化幂运算的三种架构的特点:第一个FPGA原型具有顺序架构，第二个具有并行架构，第三个具有基于收缩阵列的架构。我们使用时间/单次/面积经典因素来比较三种原型。所有三个原型都使用流行的Montgomery算法实现模块化乘法。

引用次数: 6

JRastro: a trace agent for debugging multithreaded and distributed Java programs JRastro:用于调试多线程和分布式Java程序的跟踪代理

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250320

Gabriela Jacques-Silva, L. Schnorr, B. Stein

Program tracing is one of the most used techniques to debug parallel and distributed programs. In this technique, events are recorded in trace files during the execution of the program for post mortem visualization of its behavior. We describe JRastro, a trace agent capable of tracing Java programs. The agent was designed to cover three key features: to be transparent to the application developer, to use unmodified Java virtual machines and to observe remote method invocations. By integrating these three features, JRastro differentiates itself from similar tools. Unfortunately, for a complete and clean implementation of RMI visualization, additional support on the Java monitoring system is needed.

程序跟踪是调试并行和分布式程序最常用的技术之一。在这种技术中，在程序执行期间将事件记录在跟踪文件中，以便对其行为进行事后可视化。我们描述了JRastro，一个能够跟踪Java程序的跟踪代理。该代理被设计为包含三个关键特性:对应用程序开发人员透明、使用未修改的Java虚拟机和观察远程方法调用。通过集成这三个特性，JRastro将自己与类似的工具区分开来。不幸的是，对于RMI可视化的完整而干净的实现，还需要对Java监视系统提供额外的支持。

引用次数: 11

A BSP/CGM algorithm for computing Euler tours in graphs 图中欧拉游的BSP/CGM算法

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250336

E. Cáceres, C. Y. Nasu

We describe a parallel algorithm using the BSP/CGM model (Bulk Synchronous Parallel/Coarse Grained Multicomputer) to obtain the Euler tours in graphs. It is based on the PRAM (parallel random access machine) algorithm by Caceres et al. For an input graph of n vertices and m edges, the algorithm requires local computation time of O((m+n)/p), O((m+n'p) memory and O(logp) communication rounds, where p is the number of processors. To our knowledge there are no other parallel algorithms under the coarse-grained models for the Euler tours in graphs. The proposed algorithm is implemented using MPI (message passing interface) and the C language. The parallel program runs on a Beowulf with 66 nodes. The implementation results confirm the theoretical complexity results of the algorithm.

本文描述了一种利用BSP/CGM模型(批量同步并行/粗粒度多计算机)获得图中的欧拉行程的并行算法。该算法基于Caceres等人提出的并行随机存取机(PRAM)算法。对于一个有n个顶点和m条边的输入图，该算法需要O((m+n)/p)、O((m+n'p)内存和O(logp)轮通信的局部计算时间，其中p是处理器的数量。据我们所知，在图中的欧拉巡回的粗粒度模型下没有其他并行算法。该算法采用MPI(消息传递接口)和C语言实现。并行程序运行在具有66个节点的Beowulf上。实现结果证实了该算法的理论复杂度结果。

引用次数: 0

Dynamic load balancing in PC clusters: an application to a multiphysics model PC集群中的动态负载平衡:多物理场模型的应用

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250338

Ricardo Vargas Dorneles, Rogério Luís Rizzi, T. A. Diverio, P. Navaux

We describe the use of dynamic load balancing in a PC cluster, applied to a multiphysics model that combines the parallel solution for three-dimensional (3D) PDEs of shallow water bodies flow and the parallel solution for the three-dimensional PDEs of scalar transportation of substances. The dynamic load balancing is obtained via diffusion algorithms. The numerical mesh is partitioned using RCB algorithm, in order to minimize communication and balance the load. Parallelism is obtained through Schwarz's additive domain decomposition method (DDM), so that the subproblems are solved concurrently. SPMD is the programming model used and the message passing between processes in the PC cluster is done with MPICH library.

我们描述了动态负载平衡在PC集群中的应用，应用于一个多物理场模型，该模型结合了浅水水体流动的三维偏微分方程的并行解和物质标量运输的三维偏微分方程的并行解。通过扩散算法实现动态负载均衡。采用RCB算法对数值网格进行划分，以减少通信和平衡负载。通过Schwarz的加性区域分解方法(DDM)获得并行性，从而使子问题并行求解。SPMD是使用的编程模型，PC集群中进程之间的消息传递是用MPICH库完成的。

引用次数: 2

Load balancing on stateful clustered Web servers 有状态集群Web服务器上的负载平衡

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250340

George Teodoro, T. Tavares, Bruno Coutinho, Wagner Meira Jr, Dorgival Olavo Guedes Neto

One of the main challenges to the wide use of the Internet is the scalability of the servers, that is, their ability to handle the increasing demand. Scalability in stateful servers, which comprise e-commerce and other transaction-oriented servers, is even more difficult, since it is necessary to keep transaction data across requests from the same user. One common strategy for achieving scalability is to employ clustered servers, where the load is distributed among the various servers. However, as a consequence of the workload characteristics and the need of maintaining data coherent among the servers that compose the cluster, load imbalance arise among servers, reducing the efficiency of the server as a whole. We propose and evaluate a strategy for load balancing in stateful clustered servers. Our strategy is based on control theory and allowed significant gains over configurations that do not employ the load balancing strategy, reducing the response time in up to 50% and increasing the throughput in up to 16%.

Internet广泛使用的主要挑战之一是服务器的可伸缩性，即它们处理日益增长的需求的能力。有状态服务器(包括电子商务和其他面向事务的服务器)中的可伸缩性甚至更加困难，因为有必要跨来自同一用户的请求保存事务数据。实现可伸缩性的一种常见策略是使用集群服务器，其中负载分布在各种服务器之间。但是，由于工作负载特征和在组成集群的服务器之间维护数据一致性的需要，服务器之间会出现负载不平衡，从而降低整个服务器的效率。我们提出并评估了一种在有状态集群服务器中实现负载平衡的策略。我们的策略基于控制理论，与不采用负载平衡策略的配置相比，可以获得显著的收益，将响应时间减少多达50%，并将吞吐量提高多达16%。

引用次数: 16

The limits of speculative trace reuse on deeply pipelined processors 深度流水线处理器上推测性跟踪重用的限制

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

Pub Date : 2003-11-10 DOI: 10.1109/CAHPC.2003.1250319

M. Pilla, Amarildo T. da Costa, F. França, B. Childers, M. Soffa

Trace reuse improves the performance of processors by skipping the execution of sequences of redundant instructions. However, many reusable traces do not have all of their inputs ready by the time the reuse test is done. For these cases, we developed a new technique called reuse through speculation on traces (RST), where trace inputs may be predicted. We study the limits of RST for modern processors with deep pipelines, as well as the effects of constraining resources on performance. We show that our approach reuses more traces than the nonspeculative trace reuse technique, with speedups of 43% over a nonspeculative trace reuse and 57% when memory accesses are reused.

跟踪重用通过跳过冗余指令序列的执行来提高处理器的性能。然而，许多可重用跟踪在完成重用测试时并没有准备好所有的输入。对于这些情况，我们开发了一种新技术，称为通过推测跟踪(RST)进行重用，其中可以预测跟踪输入。我们研究了具有深管道的现代处理器的RST限制，以及约束资源对性能的影响。我们表明，我们的方法比非推测性跟踪重用技术重用了更多的跟踪，与非推测性跟踪重用相比，速度提高了43%，当内存访问被重用时，速度提高了57%。

引用次数: 12

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀