首页 > 最新文献

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)最新文献

英文 中文
LOTS: a software DSM supporting large object space LOTS:支持大对象空间的软件DSM
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392620
B. Cheung, Cho-Li Wang, F. Lau
Software DSM provides good programmability for cluster computing, but its performance and limited shared memory space for large applications hinder its popularity. This paper introduces LOTS, a C++ runtime library supporting a large shared object space. With its dynamic memory mapping mechanism, LOTS can map more objects, lazily from the local disk to the virtual memory during access, leaving only a trace of control information for each object in the local process space. To our knowledge, LOTS is the first pure runtime software DSM supporting a shared object space larger than the local process space. Our testing shows that LOTS can utilize all the free hard disk space available to support hundreds of gigabytes of shared objects with a small overhead. The scope consistency memory model and a mixed coherence protocol allow LOTS to achieve better scalability with respect to problem size and cluster size.
软件DSM为集群计算提供了良好的可编程性,但它的性能和有限的大型应用程序共享内存空间阻碍了它的普及。本文介绍了一个支持大型共享对象空间的c++运行库LOTS。通过其动态内存映射机制,LOTS可以在访问期间将更多对象从本地磁盘惰性地映射到虚拟内存,在本地进程空间中只留下对每个对象的控制信息的跟踪。据我们所知,LOTS是第一个支持比本地进程空间更大的共享对象空间的纯运行时软件DSM。我们的测试表明,lot可以利用所有可用的空闲硬盘空间,以很小的开销支持数百gb的共享对象。作用域一致性内存模型和混合一致性协议允许lot在问题大小和集群大小方面实现更好的可伸缩性。
{"title":"LOTS: a software DSM supporting large object space","authors":"B. Cheung, Cho-Li Wang, F. Lau","doi":"10.1109/CLUSTR.2004.1392620","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392620","url":null,"abstract":"Software DSM provides good programmability for cluster computing, but its performance and limited shared memory space for large applications hinder its popularity. This paper introduces LOTS, a C++ runtime library supporting a large shared object space. With its dynamic memory mapping mechanism, LOTS can map more objects, lazily from the local disk to the virtual memory during access, leaving only a trace of control information for each object in the local process space. To our knowledge, LOTS is the first pure runtime software DSM supporting a shared object space larger than the local process space. Our testing shows that LOTS can utilize all the free hard disk space available to support hundreds of gigabytes of shared objects with a small overhead. The scope consistency memory model and a mixed coherence protocol allow LOTS to achieve better scalability with respect to problem size and cluster size.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131042119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
MPI tuning with Intel/spl copy/ Trace Analyzer and Intel/spl copy/ Trace Collector MPI调优与英特尔/spl复制/跟踪分析仪和英特尔/spl复制/跟踪收集器
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392595
R. Asbury, M. Wrinn
Intel/spl copy/ Cluster Tools assist developers of distributed parallel software to analyze and optimize applications on clusters. This tutorial uses a combination of lecture, demo, and (primarily) lab exercises with these tools to introduce event-based tracing techniques for MPI applications. The tools used in this tutorial were formerly marketed as Vampir and Vampirtrace.
英特尔/spl复制/集群工具帮助分布式并行软件的开发人员分析和优化集群上的应用程序。本教程结合了讲座、演示和(主要是)实验室练习,介绍了用于MPI应用程序的基于事件的跟踪技术。本教程中使用的工具以前被称为Vampir和Vampirtrace。
{"title":"MPI tuning with Intel/spl copy/ Trace Analyzer and Intel/spl copy/ Trace Collector","authors":"R. Asbury, M. Wrinn","doi":"10.1109/CLUSTR.2004.1392595","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392595","url":null,"abstract":"Intel/spl copy/ Cluster Tools assist developers of distributed parallel software to analyze and optimize applications on clusters. This tutorial uses a combination of lecture, demo, and (primarily) lab exercises with these tools to introduce event-based tracing techniques for MPI applications. The tools used in this tutorial were formerly marketed as Vampir and Vampirtrace.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123994084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A community faulted-crust model using PYRAMID on cluster platforms 基于聚类平台的群体断壳模型
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392656
J. Parker, G. Lyzenga, C. Norton, E. Tisdale, A. Donnellan
Development has boosted the GeoFEST system for simulating the faulted crust from a local desktop research application to a community model deployed on advanced cluster platforms, including an Apple G5, Intel P4, SGI Altix 3000, and HP Itaniam 2 clusters. GeoFEST uses unstructured tetrahedral meshes to follow details of stress evolution, fault slip, and plastic/elastic processes in quake-prone inhomogeneous regions, like Los Angeles. This makes it ideal for interpreting GPS and radar measurements of deformation. To remake GeoFEST as a high-performance community code, essential new features are Web accessibility, scalable performance on popular clusters, and parallel adaptive mesh refinement (PAMR). While GeoFEST source is available for free download, a Web portal environment is also supported. Users cap work entirely within a Web browser from problem definition to results animation, using tools like a database of faults, meshing, GeoFEST, and visualization. For scalable deployment, GeoFEST now relies on the PYRAMID library. The direct solver was rewritten as an iterative method, using PYRAMID'S support for partitioning. Analysis determined that scaling is most sensitive to solver communication required at the domain boundaries. Direct pairwise exchange proved successful (linear), while a binary tree method involving all domains was not. On current Intel clusters with Myrinet the application has insignificant communication overhead for problems down to /spl sim/1000s of elements per processor. Over one million elements run well on 64 processors. Initial tests using PYRAMID for the PAMR (essential for regional simulations) and a strain-energy metric produce quality meshes.
随着GeoFEST系统的发展,用于模拟断层地壳的GeoFEST系统从本地桌面研究应用程序发展成为部署在高级集群平台(包括Apple G5、Intel P4、SGI Altix 3000和HP Itaniam 2集群)上的社区模型。GeoFEST使用非结构化的四面体网格来跟踪地震易发的非均匀地区(如洛杉矶)的应力演化、断层滑动和塑性/弹性过程的细节。这使得它非常适合解释GPS和雷达测量的变形。为了将GeoFEST重塑为高性能社区代码,基本的新特性是Web可访问性、流行集群上的可伸缩性能和并行自适应网格细化(PAMR)。虽然GeoFEST源代码可以免费下载,但它也支持Web门户环境。用户可以使用故障数据库、网格划分、GeoFEST和可视化等工具,完全在Web浏览器中完成从问题定义到结果动画的工作。对于可伸缩的部署,GeoFEST现在依赖于PYRAMID库。将直接求解法改写为迭代法,使用PYRAMID的分区支持。分析表明,尺度对域边界处求解器通信需求最为敏感。直接配对交换被证明是成功的(线性),而涉及所有域的二叉树方法则不成功。在当前使用Myrinet的Intel集群上,应用程序的通信开销微不足道,每个处理器的元素开销低至/ sp1sim /1000s。超过100万个元素在64个处理器上运行良好。使用金字塔进行PAMR(区域模拟必不可少)和应变能度量的初步测试产生了高质量的网格。
{"title":"A community faulted-crust model using PYRAMID on cluster platforms","authors":"J. Parker, G. Lyzenga, C. Norton, E. Tisdale, A. Donnellan","doi":"10.1109/CLUSTR.2004.1392656","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392656","url":null,"abstract":"Development has boosted the GeoFEST system for simulating the faulted crust from a local desktop research application to a community model deployed on advanced cluster platforms, including an Apple G5, Intel P4, SGI Altix 3000, and HP Itaniam 2 clusters. GeoFEST uses unstructured tetrahedral meshes to follow details of stress evolution, fault slip, and plastic/elastic processes in quake-prone inhomogeneous regions, like Los Angeles. This makes it ideal for interpreting GPS and radar measurements of deformation. To remake GeoFEST as a high-performance community code, essential new features are Web accessibility, scalable performance on popular clusters, and parallel adaptive mesh refinement (PAMR). While GeoFEST source is available for free download, a Web portal environment is also supported. Users cap work entirely within a Web browser from problem definition to results animation, using tools like a database of faults, meshing, GeoFEST, and visualization. For scalable deployment, GeoFEST now relies on the PYRAMID library. The direct solver was rewritten as an iterative method, using PYRAMID'S support for partitioning. Analysis determined that scaling is most sensitive to solver communication required at the domain boundaries. Direct pairwise exchange proved successful (linear), while a binary tree method involving all domains was not. On current Intel clusters with Myrinet the application has insignificant communication overhead for problems down to /spl sim/1000s of elements per processor. Over one million elements run well on 64 processors. Initial tests using PYRAMID for the PAMR (essential for regional simulations) and a strain-energy metric produce quality meshes.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133922441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A comparison of 4X InfiniBand and Quadrics Elan-4 technologies 4X InfiniBand和Quadrics Elan-4技术的比较
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392617
R. Brightwell, D. Doerfler, K. Underwood
Quadrics Elan-4 and 4X InfiniBand have comparable performance in terms of peak bandwidth and ping-pong latency. In contrast, the two network architectures differ dramatically in details ranging from signaling technologies to programming interface design to software stacks. Both networks compete in the high performance computing marketplace, and InfiniBand is currently receiving a significant amount of attention, due mostly to its potential cost/performance advantage. This work compares 4X InfiniBand and Quadrics Elan-4 on identical compute hardware using application benchmarks of importance to the DOE community. We use scaling efficiency as the main performance metric, and we also provide a cost analysis for different network configurations. Although our 32-node test platform is relatively small, some scaling issues are evident. In general, the Quadrics hardware scales slightly better on most of the applications tested.
Quadrics Elan-4和4X InfiniBand在峰值带宽和乒乓延迟方面具有相当的性能。相比之下,从信号技术到编程接口设计再到软件堆栈,这两种网络架构在细节上差别很大。这两种网络都在高性能计算市场上竞争,InfiniBand目前正受到大量关注,主要是由于其潜在的成本/性能优势。这项工作使用对DOE社区重要的应用基准,在相同的计算硬件上比较4X InfiniBand和Quadrics Elan-4。我们使用扩展效率作为主要性能指标,我们还提供了不同网络配置的成本分析。虽然我们的32节点测试平台相对较小,但一些可伸缩性问题很明显。一般来说,在大多数测试的应用程序中,Quadrics硬件的可伸缩性稍好一些。
{"title":"A comparison of 4X InfiniBand and Quadrics Elan-4 technologies","authors":"R. Brightwell, D. Doerfler, K. Underwood","doi":"10.1109/CLUSTR.2004.1392617","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392617","url":null,"abstract":"Quadrics Elan-4 and 4X InfiniBand have comparable performance in terms of peak bandwidth and ping-pong latency. In contrast, the two network architectures differ dramatically in details ranging from signaling technologies to programming interface design to software stacks. Both networks compete in the high performance computing marketplace, and InfiniBand is currently receiving a significant amount of attention, due mostly to its potential cost/performance advantage. This work compares 4X InfiniBand and Quadrics Elan-4 on identical compute hardware using application benchmarks of importance to the DOE community. We use scaling efficiency as the main performance metric, and we also provide a cost analysis for different network configurations. Although our 32-node test platform is relatively small, some scaling issues are evident. In general, the Quadrics hardware scales slightly better on most of the applications tested.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132920332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Communicating efficiently on cluster based grids with MPICH-VMI 基于MPICH-VMI的集群网格高效通信
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392598
A. Pant, Hassan Jafri
Emerging infrastructure of computational grids composed of clusters-of-clusters (CoC) interlinked through high throughput channels promises unprecedented raw compute power for terascale applications. Projects such as the NSF Teragrid and EU Datagrid deploy CoCs across multiple geographical sites providing tens ofteraflops. Efficient scaling of terascale applications on these grids poses a challenge due to the heterogeneous nature of the resources (operating systems and SANs) present at each site that makes interoperability among multiple clusters difficult. In addition, due to the enormous disparity in latency and throughput of the channels within the SAN and those interlinking multiple clusters, these CoC grids contain deep communication hierarchies that prohibit efficient scaling of tightly-coupled applications. We present a design of a grid-enabled MPI called MPICH-VMI for running terascale applications over CoC based computational grids. MPICH- VMI is based on MPICH implementation of MPI 1.1 standard and utilizes a middleware messaging library called the virtual machine interface (VMI). VM enables MPICH- VMI to communicate over heterogeneous networks common in CoC based grid. MPICH-VMI also features novel optimizations for hiding communication hierarchies present in CoC based grids. We also present some preliminary results with MPICH-VMI running on the TeraGridfor MPl benchmarks and applications.
新兴的计算网格基础设施由集群的集群(CoC)组成,通过高吞吐量通道相互连接,为万亿级应用程序提供了前所未有的原始计算能力。诸如NSF Teragrid和EU Datagrid等项目在多个地理站点部署coc,提供每秒数十次的浮点运算。由于每个站点上存在的资源(操作系统和san)的异构性质,使得多个集群之间的互操作性变得困难,因此在这些网格上有效地扩展万亿级应用程序提出了一个挑战。此外,由于SAN内通道和那些相互连接多个集群的通道在延迟和吞吐量方面存在巨大差异,这些CoC网格包含深层通信层次结构,这妨碍了紧密耦合应用程序的有效扩展。我们提出了一种支持网格的MPI设计,称为MPICH-VMI,用于在基于CoC的计算网格上运行万亿级应用程序。MPICH- VMI基于MPI 1.1标准的MPICH实现,并利用称为虚拟机接口(VMI)的中间件消息传递库。VM使MPICH- VMI能够在基于CoC的网格中常见的异构网络上进行通信。MPICH-VMI还具有隐藏基于CoC的网格中存在的通信层次结构的新颖优化。我们还介绍了在TeraGridfor MPl基准测试和应用程序上运行MPICH-VMI的一些初步结果。
{"title":"Communicating efficiently on cluster based grids with MPICH-VMI","authors":"A. Pant, Hassan Jafri","doi":"10.1109/CLUSTR.2004.1392598","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392598","url":null,"abstract":"Emerging infrastructure of computational grids composed of clusters-of-clusters (CoC) interlinked through high throughput channels promises unprecedented raw compute power for terascale applications. Projects such as the NSF Teragrid and EU Datagrid deploy CoCs across multiple geographical sites providing tens ofteraflops. Efficient scaling of terascale applications on these grids poses a challenge due to the heterogeneous nature of the resources (operating systems and SANs) present at each site that makes interoperability among multiple clusters difficult. In addition, due to the enormous disparity in latency and throughput of the channels within the SAN and those interlinking multiple clusters, these CoC grids contain deep communication hierarchies that prohibit efficient scaling of tightly-coupled applications. We present a design of a grid-enabled MPI called MPICH-VMI for running terascale applications over CoC based computational grids. MPICH- VMI is based on MPICH implementation of MPI 1.1 standard and utilizes a middleware messaging library called the virtual machine interface (VMI). VM enables MPICH- VMI to communicate over heterogeneous networks common in CoC based grid. MPICH-VMI also features novel optimizations for hiding communication hierarchies present in CoC based grids. We also present some preliminary results with MPICH-VMI running on the TeraGridfor MPl benchmarks and applications.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123393940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Give your bootstrap the boot: using the operating system to boot the operating system 给你的引导程序引导:使用操作系统引导操作系统
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392643
R. Minnich
One of the slowest and most annoying aspects of system management is the simple act of rebooting the system. The sysadmin starts from a known state $the OS is running - and hands the computer over to an untrustworthy piece of software. With enough nodes involved, there is a certain chance that the process will fail on one of them. Bootstrapping is well named - it takes the system down to a low level, from which return is uncertain. It would be much better if we could use the known, trusted OS software to manage the boot process. The OS can apply all its power to the problem of locating, verifying, and loading a new OS image. Error checking and feedback can be far more robust. We discuss five systems for Linux and Plan 9 that allow the OS to boot the OS. These systems allow for the complete elimination of old-fashioned bootstrap.
系统管理中最慢和最烦人的一个方面是重新启动系统的简单行为。系统管理员从一个已知的状态(操作系统正在运行)启动,并将计算机交给一个不值得信任的软件。如果涉及到足够多的节点,那么进程在其中一个节点上失败的可能性就很大。自举(Bootstrapping)的名字很好——它将系统降低到一个较低的水平,从这个水平返回是不确定的。如果我们可以使用已知的、可信的操作系统软件来管理引导过程,那就更好了。操作系统可以将其所有功能用于定位、验证和加载新操作系统映像的问题。错误检查和反馈可以更加健壮。我们将讨论五种用于Linux和Plan 9的系统,它们允许操作系统引导操作系统。这些系统允许完全消除老式的引导。
{"title":"Give your bootstrap the boot: using the operating system to boot the operating system","authors":"R. Minnich","doi":"10.1109/CLUSTR.2004.1392643","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392643","url":null,"abstract":"One of the slowest and most annoying aspects of system management is the simple act of rebooting the system. The sysadmin starts from a known state $the OS is running - and hands the computer over to an untrustworthy piece of software. With enough nodes involved, there is a certain chance that the process will fail on one of them. Bootstrapping is well named - it takes the system down to a low level, from which return is uncertain. It would be much better if we could use the known, trusted OS software to manage the boot process. The OS can apply all its power to the problem of locating, verifying, and loading a new OS image. Error checking and feedback can be far more robust. We discuss five systems for Linux and Plan 9 that allow the OS to boot the OS. These systems allow for the complete elimination of old-fashioned bootstrap.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122897061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance analysis tools for large-scale Linux clusters 大规模Linux集群的性能分析工具
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392635
Z. Cvetanovic
As cluster computer environments increase in size and complexity, it is becoming more challenging to analyze and identify factors that limit performance and scalability. Easy-to-use tools that help identify such bottlenecks are crucial for tuning applications and configuring systems for best performance. We present a collection of visualization tools, which allow users to monitor load on all cluster components simultaneously, with negligible overhead, and no changes in the application. We include examples where the tools have been used to identify bottlenecks within a cluster and improve performance. We provide several examples of application profiles gathered using the tools and outline the methodology for projecting performance of future cluster platforms.
随着集群计算机环境的规模和复杂性的增加,分析和识别限制性能和可伸缩性的因素变得越来越具有挑战性。帮助识别此类瓶颈的易于使用的工具对于调优应用程序和配置系统以获得最佳性能至关重要。我们提供了一组可视化工具,它们允许用户同时监视所有集群组件上的负载,开销可以忽略不计,并且不改变应用程序。我们提供了一些示例,其中使用这些工具来识别集群中的瓶颈并提高性能。我们提供了几个使用这些工具收集的应用程序概要示例,并概述了预测未来集群平台性能的方法。
{"title":"Performance analysis tools for large-scale Linux clusters","authors":"Z. Cvetanovic","doi":"10.1109/CLUSTR.2004.1392635","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392635","url":null,"abstract":"As cluster computer environments increase in size and complexity, it is becoming more challenging to analyze and identify factors that limit performance and scalability. Easy-to-use tools that help identify such bottlenecks are crucial for tuning applications and configuring systems for best performance. We present a collection of visualization tools, which allow users to monitor load on all cluster components simultaneously, with negligible overhead, and no changes in the application. We include examples where the tools have been used to identify bottlenecks within a cluster and improve performance. We provide several examples of application profiles gathered using the tools and outline the methodology for projecting performance of future cluster platforms.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126581633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast broadcast by the divide-and-conquer algorithm 采用分治算法快速广播
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392653
Dongyoung Kim, Dongseung Kim
Collective communication functions including the broadcast in cluster computers usually take O(m log P) time in propagating the size-m message to P processors. We have devised a new O(m) broadcast algorithm, independent of the number of processors involved, by using divided-and-conquer algorithm. Details are given below.
包括广播在内的集群计算机中的集体通信功能将大小为m的消息传播到P个处理器通常需要O(m log P)时间。我们利用分治算法设计了一种新的O(m)广播算法,该算法与所涉及的处理器数量无关。详情如下。
{"title":"Fast broadcast by the divide-and-conquer algorithm","authors":"Dongyoung Kim, Dongseung Kim","doi":"10.1109/CLUSTR.2004.1392653","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392653","url":null,"abstract":"Collective communication functions including the broadcast in cluster computers usually take O(m log P) time in propagating the size-m message to P processors. We have devised a new O(m) broadcast algorithm, independent of the number of processors involved, by using divided-and-conquer algorithm. Details are given below.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125752696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Implementing parallel conjugate gradient on the EARTH multithreaded architecture 在EARTH多线程架构上实现并行共轭梯度
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392645
Fei Chen, K. B. Theobald, G. Gao
Conjugate gradient (CG) is one of the most popular iterative approaches to solving large sparse linear systems of equations. This work reports a parallel implementation of CG on clusters with EARTH multithreaded runtime support. Interphase and intraphase communication costs are balanced using a two-dimensional blocking method, minimizing overall communication costs. EARTH'S adaptive, event-driven multithreaded execution model gives additional opportunities to overlap communication and computation to achieve even better scalability. Experiments on a large Beowulf cluster with gigabit Ethernet show notable improvements over other parallel CG implementations. For example, with the NAS CG benchmark problem size Class C, our implementation achieved a speedup of 41 on a 64-node cluster, compared to 13 for the MPl-based NAS version. The results demonstrate that the combination of the two-dimensional blocking method and the EARTH architectural runtime support helps to compensate for the low communications bandwidth common to most clusters.
共轭梯度(CG)是求解大型稀疏线性方程组最常用的迭代方法之一。这项工作报告了在具有EARTH多线程运行时支持的集群上并行实现CG。采用二维阻塞方法平衡期间和期内通信成本,使总通信成本最小化。EARTH的自适应、事件驱动的多线程执行模型为通信和计算的重叠提供了额外的机会,以实现更好的可伸缩性。在具有千兆以太网的大型Beowulf集群上的实验表明,与其他并行CG实现相比,有显著的改进。例如,对于NAS CG基准问题大小Class C,我们的实现在64节点集群上实现了41的加速,而基于mpls的NAS版本为13。结果表明,二维阻塞方法与EARTH体系结构运行时支持的结合有助于弥补大多数集群常见的低通信带宽。
{"title":"Implementing parallel conjugate gradient on the EARTH multithreaded architecture","authors":"Fei Chen, K. B. Theobald, G. Gao","doi":"10.1109/CLUSTR.2004.1392645","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392645","url":null,"abstract":"Conjugate gradient (CG) is one of the most popular iterative approaches to solving large sparse linear systems of equations. This work reports a parallel implementation of CG on clusters with EARTH multithreaded runtime support. Interphase and intraphase communication costs are balanced using a two-dimensional blocking method, minimizing overall communication costs. EARTH'S adaptive, event-driven multithreaded execution model gives additional opportunities to overlap communication and computation to achieve even better scalability. Experiments on a large Beowulf cluster with gigabit Ethernet show notable improvements over other parallel CG implementations. For example, with the NAS CG benchmark problem size Class C, our implementation achieved a speedup of 41 on a 64-node cluster, compared to 13 for the MPl-based NAS version. The results demonstrate that the combination of the two-dimensional blocking method and the EARTH architectural runtime support helps to compensate for the low communications bandwidth common to most clusters.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133788551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Parallel competitive learning algorithm for fast codebook design on partitioned space 分区空间上快速码本设计的并行竞争学习算法
Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392644
S. Momose, K. Sano, K. Suzuki, Tadao Nakamura
Vector quantization (VQ) is an attractive technique for lossy data compression, which is a key technology for data storage and/or transfer. So far, various competitive learning (CL) algorithms have been proposed to design optimal codebooks presenting quantization with minimized errors. However, their practical use has been limited for large scale problems, due to the computational complexity of competitive learning. This work presents a parallel competitive learning algorithm for fast code-book design based on space partitioning. The algorithm partitions input-vector space into some subspaces, and independently designs corresponding subcodebooks for these subspaces with computational complexity reduced. Independent processing on different subspaces can be processed in parallel without synchronization overhead, resulting in high scalability. We perform experiments of parallel codebook design on a commodity PC cluster with 8 nodes. Experimental results show that the high speedup of the codebook design is obtained without increase of quantization errors.
矢量量化(VQ)是一种有吸引力的有损数据压缩技术,是数据存储和传输的关键技术。到目前为止,已经提出了各种竞争学习(CL)算法来设计具有最小化误差的量化的最优码本。然而,由于竞争学习的计算复杂性,它们的实际应用在大规模问题上受到限制。本文提出了一种基于空间划分的并行竞争学习算法,用于快速码本设计。该算法将输入向量空间划分为若干子空间,并为这些子空间独立设计相应的子码本,降低了计算复杂度。不同子空间上的独立处理可以并行处理,没有同步开销,从而具有高可伸缩性。我们在一个8节点的商用PC集群上进行了并行码本设计的实验。实验结果表明,在不增加量化误差的情况下,该码本设计获得了较高的加速。
{"title":"Parallel competitive learning algorithm for fast codebook design on partitioned space","authors":"S. Momose, K. Sano, K. Suzuki, Tadao Nakamura","doi":"10.1109/CLUSTR.2004.1392644","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392644","url":null,"abstract":"Vector quantization (VQ) is an attractive technique for lossy data compression, which is a key technology for data storage and/or transfer. So far, various competitive learning (CL) algorithms have been proposed to design optimal codebooks presenting quantization with minimized errors. However, their practical use has been limited for large scale problems, due to the computational complexity of competitive learning. This work presents a parallel competitive learning algorithm for fast code-book design based on space partitioning. The algorithm partitions input-vector space into some subspaces, and independently designs corresponding subcodebooks for these subspaces with computational complexity reduced. Independent processing on different subspaces can be processed in parallel without synchronization overhead, resulting in high scalability. We perform experiments of parallel codebook design on a commodity PC cluster with 8 nodes. Experimental results show that the high speedup of the codebook design is obtained without increase of quantization errors.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121065095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1