首页 > 最新文献

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)最新文献

英文 中文
A comparison of 4X InfiniBand and Quadrics Elan-4 technologies 4X InfiniBand和Quadrics Elan-4技术的比较
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392617
R. Brightwell, D. Doerfler, K. Underwood
Quadrics Elan-4 and 4X InfiniBand have comparable performance in terms of peak bandwidth and ping-pong latency. In contrast, the two network architectures differ dramatically in details ranging from signaling technologies to programming interface design to software stacks. Both networks compete in the high performance computing marketplace, and InfiniBand is currently receiving a significant amount of attention, due mostly to its potential cost/performance advantage. This work compares 4X InfiniBand and Quadrics Elan-4 on identical compute hardware using application benchmarks of importance to the DOE community. We use scaling efficiency as the main performance metric, and we also provide a cost analysis for different network configurations. Although our 32-node test platform is relatively small, some scaling issues are evident. In general, the Quadrics hardware scales slightly better on most of the applications tested.
Quadrics Elan-4和4X InfiniBand在峰值带宽和乒乓延迟方面具有相当的性能。相比之下,从信号技术到编程接口设计再到软件堆栈,这两种网络架构在细节上差别很大。这两种网络都在高性能计算市场上竞争,InfiniBand目前正受到大量关注,主要是由于其潜在的成本/性能优势。这项工作使用对DOE社区重要的应用基准,在相同的计算硬件上比较4X InfiniBand和Quadrics Elan-4。我们使用扩展效率作为主要性能指标,我们还提供了不同网络配置的成本分析。虽然我们的32节点测试平台相对较小,但一些可伸缩性问题很明显。一般来说,在大多数测试的应用程序中,Quadrics硬件的可伸缩性稍好一些。
{"title":"A comparison of 4X InfiniBand and Quadrics Elan-4 technologies","authors":"R. Brightwell, D. Doerfler, K. Underwood","doi":"10.1109/CLUSTR.2004.1392617","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392617","url":null,"abstract":"Quadrics Elan-4 and 4X InfiniBand have comparable performance in terms of peak bandwidth and ping-pong latency. In contrast, the two network architectures differ dramatically in details ranging from signaling technologies to programming interface design to software stacks. Both networks compete in the high performance computing marketplace, and InfiniBand is currently receiving a significant amount of attention, due mostly to its potential cost/performance advantage. This work compares 4X InfiniBand and Quadrics Elan-4 on identical compute hardware using application benchmarks of importance to the DOE community. We use scaling efficiency as the main performance metric, and we also provide a cost analysis for different network configurations. Although our 32-node test platform is relatively small, some scaling issues are evident. In general, the Quadrics hardware scales slightly better on most of the applications tested.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132920332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Rolls: modifying a standard system installer to support user-customizable cluster frontend appliances Rolls:修改标准系统安装程序以支持用户可定制的集群前端设备
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392641
Greg Bruno, M. Katz, Federico D. Sacerdoti, P. Papadopoulos
The Rocks toolkit uses a graph-based framework to describe the configuration of all node types (termed appliances) that make up a complete cluster. With hundreds of deployed clusters, our turnkey systems approach has shown to be quite easily adapted to different hardware and logical node configurations. However, the Rocks architecture and implementation contains a significant asymmetry: the graph definition of all appliance types except the initial frontend can be modified and extended by the end-user before installation. However, frontends can be modified only afterward by hands-on system administration. To address this administrative discontinuity between nodes and frontends, we describe the design and implementation of Rolls. First and foremost, Rolls provide both the architecture and mechanisms that enable the end-user to incrementally and programmatically modify the graph description for all appliance types. New functionality can be added and any Rocks-supplied software component can be overwritten or removed simply by inserting the desired Roll CD(s) at installation time. This symmetric approach to cluster construction has allowed us to shrink the core of the Rocks implementation while increasing flexibility for the end-user. Rolls are optional, automatically configured, cluster-aware software systems. Current add-ons include: scheduling systems (SGE, PBS), grid support (based on NSF Middleware Initiative), database support (DB2), Condor, integrity checking (Tripwire) and the Intel compiler. Community-specific Rolls can be and are developed by groups outside of the Rocks core development group.
Rocks工具包使用基于图形的框架来描述组成完整集群的所有节点类型(称为设备)的配置。在部署了数百个集群的情况下,我们的交钥匙系统方法可以很容易地适应不同的硬件和逻辑节点配置。然而,Rocks体系结构和实现包含一个显著的不对称性:终端用户可以在安装之前修改和扩展除初始前端之外的所有设备类型的图形定义。然而,前端只能在之后通过实际的系统管理来修改。为了解决节点和前端之间的这种管理不连续性,我们描述了Rolls的设计和实现。首先,Rolls提供了体系结构和机制,使最终用户能够以增量方式和编程方式修改所有设备类型的图描述。可以添加新的功能,并且可以通过在安装时插入所需的Roll CD来覆盖或删除任何rock提供的软件组件。这种对称的集群构建方法使我们能够缩小Rocks实现的核心,同时为最终用户增加灵活性。roll是可选的、自动配置的、集群感知的软件系统。当前的附加组件包括:调度系统(SGE、PBS)、网格支持(基于NSF中间件倡议)、数据库支持(DB2)、Condor、完整性检查(Tripwire)和英特尔编译器。社区特定的roll可以由Rocks核心开发小组之外的小组开发。
{"title":"Rolls: modifying a standard system installer to support user-customizable cluster frontend appliances","authors":"Greg Bruno, M. Katz, Federico D. Sacerdoti, P. Papadopoulos","doi":"10.1109/CLUSTR.2004.1392641","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392641","url":null,"abstract":"The Rocks toolkit uses a graph-based framework to describe the configuration of all node types (termed appliances) that make up a complete cluster. With hundreds of deployed clusters, our turnkey systems approach has shown to be quite easily adapted to different hardware and logical node configurations. However, the Rocks architecture and implementation contains a significant asymmetry: the graph definition of all appliance types except the initial frontend can be modified and extended by the end-user before installation. However, frontends can be modified only afterward by hands-on system administration. To address this administrative discontinuity between nodes and frontends, we describe the design and implementation of Rolls. First and foremost, Rolls provide both the architecture and mechanisms that enable the end-user to incrementally and programmatically modify the graph description for all appliance types. New functionality can be added and any Rocks-supplied software component can be overwritten or removed simply by inserting the desired Roll CD(s) at installation time. This symmetric approach to cluster construction has allowed us to shrink the core of the Rocks implementation while increasing flexibility for the end-user. Rolls are optional, automatically configured, cluster-aware software systems. Current add-ons include: scheduling systems (SGE, PBS), grid support (based on NSF Middleware Initiative), database support (DB2), Condor, integrity checking (Tripwire) and the Intel compiler. Community-specific Rolls can be and are developed by groups outside of the Rocks core development group.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130566356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
A community faulted-crust model using PYRAMID on cluster platforms 基于聚类平台的群体断壳模型
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392656
J. Parker, G. Lyzenga, C. Norton, E. Tisdale, A. Donnellan
Development has boosted the GeoFEST system for simulating the faulted crust from a local desktop research application to a community model deployed on advanced cluster platforms, including an Apple G5, Intel P4, SGI Altix 3000, and HP Itaniam 2 clusters. GeoFEST uses unstructured tetrahedral meshes to follow details of stress evolution, fault slip, and plastic/elastic processes in quake-prone inhomogeneous regions, like Los Angeles. This makes it ideal for interpreting GPS and radar measurements of deformation. To remake GeoFEST as a high-performance community code, essential new features are Web accessibility, scalable performance on popular clusters, and parallel adaptive mesh refinement (PAMR). While GeoFEST source is available for free download, a Web portal environment is also supported. Users cap work entirely within a Web browser from problem definition to results animation, using tools like a database of faults, meshing, GeoFEST, and visualization. For scalable deployment, GeoFEST now relies on the PYRAMID library. The direct solver was rewritten as an iterative method, using PYRAMID'S support for partitioning. Analysis determined that scaling is most sensitive to solver communication required at the domain boundaries. Direct pairwise exchange proved successful (linear), while a binary tree method involving all domains was not. On current Intel clusters with Myrinet the application has insignificant communication overhead for problems down to /spl sim/1000s of elements per processor. Over one million elements run well on 64 processors. Initial tests using PYRAMID for the PAMR (essential for regional simulations) and a strain-energy metric produce quality meshes.
随着GeoFEST系统的发展,用于模拟断层地壳的GeoFEST系统从本地桌面研究应用程序发展成为部署在高级集群平台(包括Apple G5、Intel P4、SGI Altix 3000和HP Itaniam 2集群)上的社区模型。GeoFEST使用非结构化的四面体网格来跟踪地震易发的非均匀地区(如洛杉矶)的应力演化、断层滑动和塑性/弹性过程的细节。这使得它非常适合解释GPS和雷达测量的变形。为了将GeoFEST重塑为高性能社区代码,基本的新特性是Web可访问性、流行集群上的可伸缩性能和并行自适应网格细化(PAMR)。虽然GeoFEST源代码可以免费下载,但它也支持Web门户环境。用户可以使用故障数据库、网格划分、GeoFEST和可视化等工具,完全在Web浏览器中完成从问题定义到结果动画的工作。对于可伸缩的部署,GeoFEST现在依赖于PYRAMID库。将直接求解法改写为迭代法,使用PYRAMID的分区支持。分析表明,尺度对域边界处求解器通信需求最为敏感。直接配对交换被证明是成功的(线性),而涉及所有域的二叉树方法则不成功。在当前使用Myrinet的Intel集群上,应用程序的通信开销微不足道,每个处理器的元素开销低至/ sp1sim /1000s。超过100万个元素在64个处理器上运行良好。使用金字塔进行PAMR(区域模拟必不可少)和应变能度量的初步测试产生了高质量的网格。
{"title":"A community faulted-crust model using PYRAMID on cluster platforms","authors":"J. Parker, G. Lyzenga, C. Norton, E. Tisdale, A. Donnellan","doi":"10.1109/CLUSTR.2004.1392656","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392656","url":null,"abstract":"Development has boosted the GeoFEST system for simulating the faulted crust from a local desktop research application to a community model deployed on advanced cluster platforms, including an Apple G5, Intel P4, SGI Altix 3000, and HP Itaniam 2 clusters. GeoFEST uses unstructured tetrahedral meshes to follow details of stress evolution, fault slip, and plastic/elastic processes in quake-prone inhomogeneous regions, like Los Angeles. This makes it ideal for interpreting GPS and radar measurements of deformation. To remake GeoFEST as a high-performance community code, essential new features are Web accessibility, scalable performance on popular clusters, and parallel adaptive mesh refinement (PAMR). While GeoFEST source is available for free download, a Web portal environment is also supported. Users cap work entirely within a Web browser from problem definition to results animation, using tools like a database of faults, meshing, GeoFEST, and visualization. For scalable deployment, GeoFEST now relies on the PYRAMID library. The direct solver was rewritten as an iterative method, using PYRAMID'S support for partitioning. Analysis determined that scaling is most sensitive to solver communication required at the domain boundaries. Direct pairwise exchange proved successful (linear), while a binary tree method involving all domains was not. On current Intel clusters with Myrinet the application has insignificant communication overhead for problems down to /spl sim/1000s of elements per processor. Over one million elements run well on 64 processors. Initial tests using PYRAMID for the PAMR (essential for regional simulations) and a strain-energy metric produce quality meshes.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133922441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Master slave scheduling on heterogeneous star-shaped platforms with limited memory 有限内存异构星形平台上的主从调度
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392654
Arnaud Legrand, Olivier Beaumont, L. Marchal, Y. Robert
Summary form only given. In this work, we consider the problem of allocating and scheduling a collection of independent, equal-sized tasks on heterogeneous star-shaped platforms. We also address the same problem for divisible tasks. For both cases, we take memory constraints into account. We prove strong NP-completeness results for different objective functions, namely makespan minimization and throughput maximization, on simple star-shaped platforms. We propose an approximation algorithm based on the unconstrained version (with unlimited memory) of the problem. We introduce several heuristics, which are evaluated and compared through extensive simulations. An unexpected conclusion drawn from these experiments is that classical scheduling heuristics that try to greedily minimize the completion time of each task are outperformed by the simple heuristic that consists in assigning the task to the available processor that has the smallest communication time, regardless of computation power (hence a "bandwidth-centric" distribution).
只提供摘要形式。在这项工作中,我们考虑了在异构星形平台上分配和调度一组独立的、大小相等的任务的问题。对于可分任务,我们也解决了同样的问题。对于这两种情况,我们都要考虑内存约束。在简单的星形平台上,我们证明了不同目标函数的强np完备性结果,即最大完工时间最小化和吞吐量最大化。我们提出了一种基于问题的无约束版本(具有无限内存)的近似算法。我们引入了几种启发式方法,并通过广泛的模拟对其进行了评估和比较。从这些实验中得出的一个意想不到的结论是,试图贪婪地最小化每个任务完成时间的经典调度启发式优于将任务分配给具有最小通信时间的可用处理器的简单启发式,而不考虑计算能力(因此是“以带宽为中心”的分布)。
{"title":"Master slave scheduling on heterogeneous star-shaped platforms with limited memory","authors":"Arnaud Legrand, Olivier Beaumont, L. Marchal, Y. Robert","doi":"10.1109/CLUSTR.2004.1392654","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392654","url":null,"abstract":"Summary form only given. In this work, we consider the problem of allocating and scheduling a collection of independent, equal-sized tasks on heterogeneous star-shaped platforms. We also address the same problem for divisible tasks. For both cases, we take memory constraints into account. We prove strong NP-completeness results for different objective functions, namely makespan minimization and throughput maximization, on simple star-shaped platforms. We propose an approximation algorithm based on the unconstrained version (with unlimited memory) of the problem. We introduce several heuristics, which are evaluated and compared through extensive simulations. An unexpected conclusion drawn from these experiments is that classical scheduling heuristics that try to greedily minimize the completion time of each task are outperformed by the simple heuristic that consists in assigning the task to the available processor that has the smallest communication time, regardless of computation power (hence a \"bandwidth-centric\" distribution).","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129972994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Give your bootstrap the boot: using the operating system to boot the operating system 给你的引导程序引导:使用操作系统引导操作系统
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392643
R. Minnich
One of the slowest and most annoying aspects of system management is the simple act of rebooting the system. The sysadmin starts from a known state $the OS is running - and hands the computer over to an untrustworthy piece of software. With enough nodes involved, there is a certain chance that the process will fail on one of them. Bootstrapping is well named - it takes the system down to a low level, from which return is uncertain. It would be much better if we could use the known, trusted OS software to manage the boot process. The OS can apply all its power to the problem of locating, verifying, and loading a new OS image. Error checking and feedback can be far more robust. We discuss five systems for Linux and Plan 9 that allow the OS to boot the OS. These systems allow for the complete elimination of old-fashioned bootstrap.
系统管理中最慢和最烦人的一个方面是重新启动系统的简单行为。系统管理员从一个已知的状态(操作系统正在运行)启动,并将计算机交给一个不值得信任的软件。如果涉及到足够多的节点,那么进程在其中一个节点上失败的可能性就很大。自举(Bootstrapping)的名字很好——它将系统降低到一个较低的水平,从这个水平返回是不确定的。如果我们可以使用已知的、可信的操作系统软件来管理引导过程,那就更好了。操作系统可以将其所有功能用于定位、验证和加载新操作系统映像的问题。错误检查和反馈可以更加健壮。我们将讨论五种用于Linux和Plan 9的系统,它们允许操作系统引导操作系统。这些系统允许完全消除老式的引导。
{"title":"Give your bootstrap the boot: using the operating system to boot the operating system","authors":"R. Minnich","doi":"10.1109/CLUSTR.2004.1392643","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392643","url":null,"abstract":"One of the slowest and most annoying aspects of system management is the simple act of rebooting the system. The sysadmin starts from a known state $the OS is running - and hands the computer over to an untrustworthy piece of software. With enough nodes involved, there is a certain chance that the process will fail on one of them. Bootstrapping is well named - it takes the system down to a low level, from which return is uncertain. It would be much better if we could use the known, trusted OS software to manage the boot process. The OS can apply all its power to the problem of locating, verifying, and loading a new OS image. Error checking and feedback can be far more robust. We discuss five systems for Linux and Plan 9 that allow the OS to boot the OS. These systems allow for the complete elimination of old-fashioned bootstrap.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122897061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Communicating efficiently on cluster based grids with MPICH-VMI 基于MPICH-VMI的集群网格高效通信
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392598
A. Pant, Hassan Jafri
Emerging infrastructure of computational grids composed of clusters-of-clusters (CoC) interlinked through high throughput channels promises unprecedented raw compute power for terascale applications. Projects such as the NSF Teragrid and EU Datagrid deploy CoCs across multiple geographical sites providing tens ofteraflops. Efficient scaling of terascale applications on these grids poses a challenge due to the heterogeneous nature of the resources (operating systems and SANs) present at each site that makes interoperability among multiple clusters difficult. In addition, due to the enormous disparity in latency and throughput of the channels within the SAN and those interlinking multiple clusters, these CoC grids contain deep communication hierarchies that prohibit efficient scaling of tightly-coupled applications. We present a design of a grid-enabled MPI called MPICH-VMI for running terascale applications over CoC based computational grids. MPICH- VMI is based on MPICH implementation of MPI 1.1 standard and utilizes a middleware messaging library called the virtual machine interface (VMI). VM enables MPICH- VMI to communicate over heterogeneous networks common in CoC based grid. MPICH-VMI also features novel optimizations for hiding communication hierarchies present in CoC based grids. We also present some preliminary results with MPICH-VMI running on the TeraGridfor MPl benchmarks and applications.
新兴的计算网格基础设施由集群的集群(CoC)组成,通过高吞吐量通道相互连接,为万亿级应用程序提供了前所未有的原始计算能力。诸如NSF Teragrid和EU Datagrid等项目在多个地理站点部署coc,提供每秒数十次的浮点运算。由于每个站点上存在的资源(操作系统和san)的异构性质,使得多个集群之间的互操作性变得困难,因此在这些网格上有效地扩展万亿级应用程序提出了一个挑战。此外,由于SAN内通道和那些相互连接多个集群的通道在延迟和吞吐量方面存在巨大差异,这些CoC网格包含深层通信层次结构,这妨碍了紧密耦合应用程序的有效扩展。我们提出了一种支持网格的MPI设计,称为MPICH-VMI,用于在基于CoC的计算网格上运行万亿级应用程序。MPICH- VMI基于MPI 1.1标准的MPICH实现,并利用称为虚拟机接口(VMI)的中间件消息传递库。VM使MPICH- VMI能够在基于CoC的网格中常见的异构网络上进行通信。MPICH-VMI还具有隐藏基于CoC的网格中存在的通信层次结构的新颖优化。我们还介绍了在TeraGridfor MPl基准测试和应用程序上运行MPICH-VMI的一些初步结果。
{"title":"Communicating efficiently on cluster based grids with MPICH-VMI","authors":"A. Pant, Hassan Jafri","doi":"10.1109/CLUSTR.2004.1392598","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392598","url":null,"abstract":"Emerging infrastructure of computational grids composed of clusters-of-clusters (CoC) interlinked through high throughput channels promises unprecedented raw compute power for terascale applications. Projects such as the NSF Teragrid and EU Datagrid deploy CoCs across multiple geographical sites providing tens ofteraflops. Efficient scaling of terascale applications on these grids poses a challenge due to the heterogeneous nature of the resources (operating systems and SANs) present at each site that makes interoperability among multiple clusters difficult. In addition, due to the enormous disparity in latency and throughput of the channels within the SAN and those interlinking multiple clusters, these CoC grids contain deep communication hierarchies that prohibit efficient scaling of tightly-coupled applications. We present a design of a grid-enabled MPI called MPICH-VMI for running terascale applications over CoC based computational grids. MPICH- VMI is based on MPICH implementation of MPI 1.1 standard and utilizes a middleware messaging library called the virtual machine interface (VMI). VM enables MPICH- VMI to communicate over heterogeneous networks common in CoC based grid. MPICH-VMI also features novel optimizations for hiding communication hierarchies present in CoC based grids. We also present some preliminary results with MPICH-VMI running on the TeraGridfor MPl benchmarks and applications.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123393940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Performance analysis tools for large-scale Linux clusters 大规模Linux集群的性能分析工具
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392635
Z. Cvetanovic
As cluster computer environments increase in size and complexity, it is becoming more challenging to analyze and identify factors that limit performance and scalability. Easy-to-use tools that help identify such bottlenecks are crucial for tuning applications and configuring systems for best performance. We present a collection of visualization tools, which allow users to monitor load on all cluster components simultaneously, with negligible overhead, and no changes in the application. We include examples where the tools have been used to identify bottlenecks within a cluster and improve performance. We provide several examples of application profiles gathered using the tools and outline the methodology for projecting performance of future cluster platforms.
随着集群计算机环境的规模和复杂性的增加,分析和识别限制性能和可伸缩性的因素变得越来越具有挑战性。帮助识别此类瓶颈的易于使用的工具对于调优应用程序和配置系统以获得最佳性能至关重要。我们提供了一组可视化工具,它们允许用户同时监视所有集群组件上的负载,开销可以忽略不计,并且不改变应用程序。我们提供了一些示例,其中使用这些工具来识别集群中的瓶颈并提高性能。我们提供了几个使用这些工具收集的应用程序概要示例,并概述了预测未来集群平台性能的方法。
{"title":"Performance analysis tools for large-scale Linux clusters","authors":"Z. Cvetanovic","doi":"10.1109/CLUSTR.2004.1392635","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392635","url":null,"abstract":"As cluster computer environments increase in size and complexity, it is becoming more challenging to analyze and identify factors that limit performance and scalability. Easy-to-use tools that help identify such bottlenecks are crucial for tuning applications and configuring systems for best performance. We present a collection of visualization tools, which allow users to monitor load on all cluster components simultaneously, with negligible overhead, and no changes in the application. We include examples where the tools have been used to identify bottlenecks within a cluster and improve performance. We provide several examples of application profiles gathered using the tools and outline the methodology for projecting performance of future cluster platforms.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126581633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast broadcast by the divide-and-conquer algorithm 采用分治算法快速广播
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392653
Dongyoung Kim, Dongseung Kim
Collective communication functions including the broadcast in cluster computers usually take O(m log P) time in propagating the size-m message to P processors. We have devised a new O(m) broadcast algorithm, independent of the number of processors involved, by using divided-and-conquer algorithm. Details are given below.
包括广播在内的集群计算机中的集体通信功能将大小为m的消息传播到P个处理器通常需要O(m log P)时间。我们利用分治算法设计了一种新的O(m)广播算法,该算法与所涉及的处理器数量无关。详情如下。
{"title":"Fast broadcast by the divide-and-conquer algorithm","authors":"Dongyoung Kim, Dongseung Kim","doi":"10.1109/CLUSTR.2004.1392653","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392653","url":null,"abstract":"Collective communication functions including the broadcast in cluster computers usually take O(m log P) time in propagating the size-m message to P processors. We have devised a new O(m) broadcast algorithm, independent of the number of processors involved, by using divided-and-conquer algorithm. Details are given below.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125752696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Implementing parallel conjugate gradient on the EARTH multithreaded architecture 在EARTH多线程架构上实现并行共轭梯度
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392645
Fei Chen, K. B. Theobald, G. Gao
Conjugate gradient (CG) is one of the most popular iterative approaches to solving large sparse linear systems of equations. This work reports a parallel implementation of CG on clusters with EARTH multithreaded runtime support. Interphase and intraphase communication costs are balanced using a two-dimensional blocking method, minimizing overall communication costs. EARTH'S adaptive, event-driven multithreaded execution model gives additional opportunities to overlap communication and computation to achieve even better scalability. Experiments on a large Beowulf cluster with gigabit Ethernet show notable improvements over other parallel CG implementations. For example, with the NAS CG benchmark problem size Class C, our implementation achieved a speedup of 41 on a 64-node cluster, compared to 13 for the MPl-based NAS version. The results demonstrate that the combination of the two-dimensional blocking method and the EARTH architectural runtime support helps to compensate for the low communications bandwidth common to most clusters.
共轭梯度(CG)是求解大型稀疏线性方程组最常用的迭代方法之一。这项工作报告了在具有EARTH多线程运行时支持的集群上并行实现CG。采用二维阻塞方法平衡期间和期内通信成本,使总通信成本最小化。EARTH的自适应、事件驱动的多线程执行模型为通信和计算的重叠提供了额外的机会,以实现更好的可伸缩性。在具有千兆以太网的大型Beowulf集群上的实验表明,与其他并行CG实现相比,有显著的改进。例如,对于NAS CG基准问题大小Class C,我们的实现在64节点集群上实现了41的加速,而基于mpls的NAS版本为13。结果表明,二维阻塞方法与EARTH体系结构运行时支持的结合有助于弥补大多数集群常见的低通信带宽。
{"title":"Implementing parallel conjugate gradient on the EARTH multithreaded architecture","authors":"Fei Chen, K. B. Theobald, G. Gao","doi":"10.1109/CLUSTR.2004.1392645","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392645","url":null,"abstract":"Conjugate gradient (CG) is one of the most popular iterative approaches to solving large sparse linear systems of equations. This work reports a parallel implementation of CG on clusters with EARTH multithreaded runtime support. Interphase and intraphase communication costs are balanced using a two-dimensional blocking method, minimizing overall communication costs. EARTH'S adaptive, event-driven multithreaded execution model gives additional opportunities to overlap communication and computation to achieve even better scalability. Experiments on a large Beowulf cluster with gigabit Ethernet show notable improvements over other parallel CG implementations. For example, with the NAS CG benchmark problem size Class C, our implementation achieved a speedup of 41 on a 64-node cluster, compared to 13 for the MPl-based NAS version. The results demonstrate that the combination of the two-dimensional blocking method and the EARTH architectural runtime support helps to compensate for the low communications bandwidth common to most clusters.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133788551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Parallel competitive learning algorithm for fast codebook design on partitioned space 分区空间上快速码本设计的并行竞争学习算法
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392644
S. Momose, K. Sano, K. Suzuki, Tadao Nakamura
Vector quantization (VQ) is an attractive technique for lossy data compression, which is a key technology for data storage and/or transfer. So far, various competitive learning (CL) algorithms have been proposed to design optimal codebooks presenting quantization with minimized errors. However, their practical use has been limited for large scale problems, due to the computational complexity of competitive learning. This work presents a parallel competitive learning algorithm for fast code-book design based on space partitioning. The algorithm partitions input-vector space into some subspaces, and independently designs corresponding subcodebooks for these subspaces with computational complexity reduced. Independent processing on different subspaces can be processed in parallel without synchronization overhead, resulting in high scalability. We perform experiments of parallel codebook design on a commodity PC cluster with 8 nodes. Experimental results show that the high speedup of the codebook design is obtained without increase of quantization errors.
矢量量化(VQ)是一种有吸引力的有损数据压缩技术,是数据存储和传输的关键技术。到目前为止,已经提出了各种竞争学习(CL)算法来设计具有最小化误差的量化的最优码本。然而,由于竞争学习的计算复杂性,它们的实际应用在大规模问题上受到限制。本文提出了一种基于空间划分的并行竞争学习算法,用于快速码本设计。该算法将输入向量空间划分为若干子空间,并为这些子空间独立设计相应的子码本,降低了计算复杂度。不同子空间上的独立处理可以并行处理,没有同步开销,从而具有高可伸缩性。我们在一个8节点的商用PC集群上进行了并行码本设计的实验。实验结果表明,在不增加量化误差的情况下,该码本设计获得了较高的加速。
{"title":"Parallel competitive learning algorithm for fast codebook design on partitioned space","authors":"S. Momose, K. Sano, K. Suzuki, Tadao Nakamura","doi":"10.1109/CLUSTR.2004.1392644","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392644","url":null,"abstract":"Vector quantization (VQ) is an attractive technique for lossy data compression, which is a key technology for data storage and/or transfer. So far, various competitive learning (CL) algorithms have been proposed to design optimal codebooks presenting quantization with minimized errors. However, their practical use has been limited for large scale problems, due to the computational complexity of competitive learning. This work presents a parallel competitive learning algorithm for fast code-book design based on space partitioning. The algorithm partitions input-vector space into some subspaces, and independently designs corresponding subcodebooks for these subspaces with computational complexity reduced. Independent processing on different subspaces can be processed in parallel without synchronization overhead, resulting in high scalability. We perform experiments of parallel codebook design on a commodity PC cluster with 8 nodes. Experimental results show that the high speedup of the codebook design is obtained without increase of quantization errors.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121065095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1