首页 > 最新文献

2008 IEEE International Conference on Cluster Computing最新文献

英文 中文
Empirical-based probabilistic upper bounds for urgent computing applications 紧急计算应用的基于经验的概率上界
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663793
N. Trebon, P. Beckman
Scientific simulation and modeling often aid in making critical decisions in such diverse fields as city planning, severe weather prediction and influenza modeling. In some of these situations the computations operate under strict deadlines, after which point the results may have very little value. In these cases of urgent computing, it is imperative that these computations begin execution as quickly as possible. The special priority and urgent compute environment (SPRUCE) is a framework designed to enable these high priority computations to quickly access computational grid resources through elevated batch queue priority. However, participating resources are allowed to decide locally how to respond to urgent requests. For instance, some may offer next-to-run status while others may preempt currently executing jobs to clear off the necessary nodes. However, the user is still faced with the problem of resource selection - namely, which resource (and corresponding urgent computing policy) provides the best probability of meeting a given deadline? This paper introduces a set of methodologies and heuristics aimed at generating an empirical-based probabilistic upper bound on the total turnaround time for an urgent computation. These upper bounds can then be used to guide the user in selecting a resource with greater confidence that their deadline will be met.
科学模拟和建模通常有助于在城市规划、恶劣天气预报和流感建模等不同领域做出关键决策。在某些情况下,计算在严格的最后期限下进行,超过这一点,结果可能没有什么价值。在这些紧急计算的情况下,必须尽可能快地开始执行这些计算。特殊优先级和紧急计算环境(SPRUCE)是一个框架,旨在使这些高优先级计算能够通过提高批处理队列优先级来快速访问计算网格资源。但是,允许参与资源在本地决定如何响应紧急请求。例如,有些可能提供下一个运行状态,而另一些可能抢占当前正在执行的作业,以清除必要的节点。然而,用户仍然面临着资源选择的问题——即,哪种资源(以及相应的紧急计算策略)提供了满足给定截止日期的最佳概率?本文介绍了一套方法和启发式方法,旨在为紧急计算生成基于经验的总周转时间的概率上界。这些上限可以用来指导用户更有信心地选择资源,以满足他们的最后期限。
{"title":"Empirical-based probabilistic upper bounds for urgent computing applications","authors":"N. Trebon, P. Beckman","doi":"10.1109/CLUSTR.2008.4663793","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663793","url":null,"abstract":"Scientific simulation and modeling often aid in making critical decisions in such diverse fields as city planning, severe weather prediction and influenza modeling. In some of these situations the computations operate under strict deadlines, after which point the results may have very little value. In these cases of urgent computing, it is imperative that these computations begin execution as quickly as possible. The special priority and urgent compute environment (SPRUCE) is a framework designed to enable these high priority computations to quickly access computational grid resources through elevated batch queue priority. However, participating resources are allowed to decide locally how to respond to urgent requests. For instance, some may offer next-to-run status while others may preempt currently executing jobs to clear off the necessary nodes. However, the user is still faced with the problem of resource selection - namely, which resource (and corresponding urgent computing policy) provides the best probability of meeting a given deadline? This paper introduces a set of methodologies and heuristics aimed at generating an empirical-based probabilistic upper bound on the total turnaround time for an urgent computation. These upper bounds can then be used to guide the user in selecting a resource with greater confidence that their deadline will be met.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127252238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scalable MPI design over InfiniBand using eXtended Reliable Connection 可扩展的MPI设计在InfiniBand上使用扩展的可靠连接
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663773
Matthew J. Koop, J. K. Sridhar, D. Panda
A significant component of a high-performance cluster is the compute node interconnect. InfiniBand, is an interconnect of such systems that is enjoying wide success due to low latency (1.0-3.0 musec) and high bandwidth and other features. The Message Passing Interface (MPI) is the dominant programming model for parallel scientific applications. As a result, the MPI library and interconnect play a significant role in the scalability. These clusters continue to scale to ever-increasing levels making the role very important. As an example, the ldquoRangerrdquo system at the Texas Advanced Computing Center (TACC) includes over 60,000 cores with nearly 4000 InfiniBand ports. Previous work has shown that memory usage simply for connections when using the Reliable Connection (RC) transport of InfiniBand can reach hundreds of megabytes of memory per process at that level. To address these scalability problems a new InfiniBand transport, eXtended Reliable Connection, has been introduced. In this paper we describe XRC and design MPI over this new transport. We describe the variety of design choices that must be made as well as the various optimizations that XRC allows. We implement our designs and evaluate it on an InfiniBand cluster against RC-based designs. The memory scalability in terms of both connection memory and memory efficiency for communication buffers is evaluated for all of the configurations. Connection memory scalability evaluation shows a potential 100 times improvement over a similarly configured RC-based design. Evaluation using NAMD shows a 10% performance improvement for our XRC-based prototype for the jac2000 benchmark.
高性能集群的一个重要组成部分是计算节点互连。InfiniBand是这种系统的互连,由于低延迟(1.0-3.0 mb)和高带宽等特性而获得了广泛的成功。消息传递接口(Message Passing Interface, MPI)是并行科学应用的主流编程模型。因此,MPI库和互连在可伸缩性中起着重要的作用。这些集群不断扩展到不断增加的级别,这使得该角色非常重要。例如,得克萨斯高级计算中心(TACC)的ldquoRangerrdquo系统包含超过60,000个内核和近4000个ib端口。先前的研究表明,当使用InfiniBand的可靠连接(RC)传输时,仅用于连接的内存使用可以达到该级别上每个进程数百兆字节的内存。为了解决这些可伸缩性问题,引入了一种新的InfiniBand传输,即扩展可靠连接。在本文中,我们描述了XRC并设计了基于这种新传输的MPI。我们描述了必须做出的各种设计选择以及XRC允许的各种优化。我们实现了我们的设计,并在InfiniBand集群上对基于rc的设计进行了评估。就通信缓冲区的连接内存和内存效率而言,对所有配置的内存可伸缩性进行了评估。连接内存可伸缩性评估显示,与类似配置的基于rc的设计相比,可能有100倍的改进。使用NAMD进行的评估显示,在jac2000基准测试中基于xrc的原型的性能提高了10%。
{"title":"Scalable MPI design over InfiniBand using eXtended Reliable Connection","authors":"Matthew J. Koop, J. K. Sridhar, D. Panda","doi":"10.1109/CLUSTR.2008.4663773","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663773","url":null,"abstract":"A significant component of a high-performance cluster is the compute node interconnect. InfiniBand, is an interconnect of such systems that is enjoying wide success due to low latency (1.0-3.0 musec) and high bandwidth and other features. The Message Passing Interface (MPI) is the dominant programming model for parallel scientific applications. As a result, the MPI library and interconnect play a significant role in the scalability. These clusters continue to scale to ever-increasing levels making the role very important. As an example, the ldquoRangerrdquo system at the Texas Advanced Computing Center (TACC) includes over 60,000 cores with nearly 4000 InfiniBand ports. Previous work has shown that memory usage simply for connections when using the Reliable Connection (RC) transport of InfiniBand can reach hundreds of megabytes of memory per process at that level. To address these scalability problems a new InfiniBand transport, eXtended Reliable Connection, has been introduced. In this paper we describe XRC and design MPI over this new transport. We describe the variety of design choices that must be made as well as the various optimizations that XRC allows. We implement our designs and evaluate it on an InfiniBand cluster against RC-based designs. The memory scalability in terms of both connection memory and memory efficiency for communication buffers is evaluated for all of the configurations. Connection memory scalability evaluation shows a potential 100 times improvement over a similarly configured RC-based design. Evaluation using NAMD shows a 10% performance improvement for our XRC-based prototype for the jac2000 benchmark.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114888009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
A dynamic programming approach to optimizing the blocking strategy for the Householder QR decomposition 一种动态规划方法优化Householder QR分解的阻塞策略
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663801
Takeshi Fukaya, Yusaku Yamamoto, Shaoliang Zhang
In this paper, we present a new approach to optimizing the blocking strategy for the householder QR decomposition. In high performance implementations of the householder QR algorithm, it is common to use a blocking technique for the efficient use of the cache memory. There are several well known blocking strategies like the fixed-size blocking and recursive blocking, and usually their parameters such as the block size and the recursion level are tuned according to the target machine and the problem size. However, strategies generated with this kind of parameter optimization constitute only a small fraction of all possible blocking strategies. Given the complex performance characteristics of modern microprocessors, non-standard strategies may prove effective on some machines. Considering this situation, we first propose a new universal model that can express a far larger class of blocking strategies than has been considered so far. Next, we give an algorithm to find a near-optimal strategy from this class using dynamic programming. As a result of this approach, we found an effective blocking strategy that has never been reported. Performance evaluation on the Opteron and Core2 processors show that our strategy achieves about 1.2 times speedup over recursive blocking when computing the QR decomposition of a 6000 times 6000 matrix.
在本文中,我们提出了一种新的方法来优化户主QR分解的阻塞策略。在户主QR算法的高性能实现中,通常使用阻塞技术来有效地使用缓存内存。有几种众所周知的阻塞策略,如固定大小的阻塞和递归阻塞,通常它们的参数(如块大小和递归级别)是根据目标机器和问题大小进行调整的。然而,这种参数优化生成的策略只占所有可能阻塞策略的一小部分。考虑到现代微处理器复杂的性能特征,非标准策略在某些机器上可能是有效的。考虑到这种情况,我们首先提出了一个新的通用模型,该模型可以表达比目前所考虑的更大的阻塞策略类别。接下来,我们给出了一种算法,利用动态规划从该类中找到接近最优的策略。由于这种方法,我们发现了一种从未报道过的有效阻断策略。在Opteron和Core2处理器上的性能评估表明,当计算6000 × 6000矩阵的QR分解时,我们的策略比递归阻塞实现了约1.2倍的加速。
{"title":"A dynamic programming approach to optimizing the blocking strategy for the Householder QR decomposition","authors":"Takeshi Fukaya, Yusaku Yamamoto, Shaoliang Zhang","doi":"10.1109/CLUSTR.2008.4663801","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663801","url":null,"abstract":"In this paper, we present a new approach to optimizing the blocking strategy for the householder QR decomposition. In high performance implementations of the householder QR algorithm, it is common to use a blocking technique for the efficient use of the cache memory. There are several well known blocking strategies like the fixed-size blocking and recursive blocking, and usually their parameters such as the block size and the recursion level are tuned according to the target machine and the problem size. However, strategies generated with this kind of parameter optimization constitute only a small fraction of all possible blocking strategies. Given the complex performance characteristics of modern microprocessors, non-standard strategies may prove effective on some machines. Considering this situation, we first propose a new universal model that can express a far larger class of blocking strategies than has been considered so far. Next, we give an algorithm to find a near-optimal strategy from this class using dynamic programming. As a result of this approach, we found an effective blocking strategy that has never been reported. Performance evaluation on the Opteron and Core2 processors show that our strategy achieves about 1.2 times speedup over recursive blocking when computing the QR decomposition of a 6000 times 6000 matrix.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122509338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
RI2N: High-bandwidth and fault-tolerant network with multi-link Ethernet for PC clusters RI2N:用于PC集群的高带宽、多链路以太网容错网络
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663781
Shin'ichi Miura, Takayuki Okamoto, T. Boku, T. Hanawa, M. Sato
Although recent high-end interconnection network devices and switches provide a high performance/cost ratio, most of the small to medium sized PC clusters are still built on the commodity network, Ethernet. To enhance performance on commonly used gigabit Ethernet networks, link aggregation or binding technology is used. Currently, a Linux kernel is equipped with a software solution named linux channel bonding (LCB), which is based on IEEE802.3ad Link Aggregation technology. However, standard LCB has the problem of mismatching with the commonly used TCP protocol, which consequently implies several problems of both large latency and instability on bandwidth improvement. The fault-tolerant feature is also supported, but the usability is not sufficient. We have developed a new implementation similar to LCB named RI2N/DRV (redundant interconnection with inexpensive network with driver) for use on a gigabit Ethernet with a complete software stack that is very compatible with the TCP protocol. Our algorithm suppresses unnecessary ACK packets and retransmission of packets even in imbalanced network traffic and link failures on multiple links. It provides both high-bandwidth and fault-tolerant communication on multi-link gigabit Ethernet. We confirmed that this system improves the performance and reliability of the network, and our system can be applied to ordinary UNIX services such as NFS, without any modification of other modules.
尽管最近的高端互连网络设备和交换机提供了很高的性能/成本比,但大多数中小型PC集群仍然建立在商品网络以太网上。为了提高常用的千兆以太网的性能,使用链路聚合或绑定技术。目前,Linux内核中有一种基于IEEE802.3ad链路聚合技术的Linux通道绑定(LCB)软件解决方案。然而,标准LCB存在与常用的TCP协议不匹配的问题,这就意味着带宽改进方面存在延迟大、不稳定等问题。它还支持容错功能,但可用性不够。我们已经开发了一种类似于LCB的新实现,名为RI2N/DRV(冗余互连与廉价的带驱动程序的网络),用于具有与TCP协议非常兼容的完整软件堆栈的千兆以太网。即使在网络流量不均衡和多链路链路故障的情况下,我们的算法也可以抑制不必要的ACK报文和报文重传。它在多链路千兆以太网上提供高带宽和容错通信。经过验证,该系统提高了网络的性能和可靠性,并且我们的系统可以应用于NFS等普通UNIX服务,而不需要修改其他模块。
{"title":"RI2N: High-bandwidth and fault-tolerant network with multi-link Ethernet for PC clusters","authors":"Shin'ichi Miura, Takayuki Okamoto, T. Boku, T. Hanawa, M. Sato","doi":"10.1109/CLUSTR.2008.4663781","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663781","url":null,"abstract":"Although recent high-end interconnection network devices and switches provide a high performance/cost ratio, most of the small to medium sized PC clusters are still built on the commodity network, Ethernet. To enhance performance on commonly used gigabit Ethernet networks, link aggregation or binding technology is used. Currently, a Linux kernel is equipped with a software solution named linux channel bonding (LCB), which is based on IEEE802.3ad Link Aggregation technology. However, standard LCB has the problem of mismatching with the commonly used TCP protocol, which consequently implies several problems of both large latency and instability on bandwidth improvement. The fault-tolerant feature is also supported, but the usability is not sufficient. We have developed a new implementation similar to LCB named RI2N/DRV (redundant interconnection with inexpensive network with driver) for use on a gigabit Ethernet with a complete software stack that is very compatible with the TCP protocol. Our algorithm suppresses unnecessary ACK packets and retransmission of packets even in imbalanced network traffic and link failures on multiple links. It provides both high-bandwidth and fault-tolerant communication on multi-link gigabit Ethernet. We confirmed that this system improves the performance and reliability of the network, and our system can be applied to ordinary UNIX services such as NFS, without any modification of other modules.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123985606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
OpenMP-centric performance analysis of hybrid applications 以openmp为中心的混合应用程序性能分析
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663767
K. Fürlinger, S. Moore
Several performance analysis tools support hybrid applications. Most originated as MPI profiling or tracing tools and OpenMP capabilities were added to extend the performance analysis capabilities for the hybrid parallelization case. In this paper we describe our experience with the other path to support both programming paradigms. Our starting point is a profiling tool for OpenMP called ompP that was extended to handle MPI related data. The measured data and the method of presentation follow our focus on the OpenMP side of the performance optimization cycle. For example, the existing overhead classification scheme of ompP was extended to cover time in MPI calls as a new type of overhead.
一些性能分析工具支持混合应用程序。大多数都起源于MPI分析或跟踪工具,而OpenMP功能是为了扩展混合并行化情况的性能分析功能而添加的。在本文中,我们描述了支持两种编程范式的其他路径的经验。我们的起点是一个名为ompP的OpenMP分析工具,它被扩展为处理MPI相关数据。测量的数据和表示方法遵循我们对性能优化周期的OpenMP方面的关注。例如,扩展了现有的ompP开销分类方案,将MPI调用中的时间作为一种新的开销类型。
{"title":"OpenMP-centric performance analysis of hybrid applications","authors":"K. Fürlinger, S. Moore","doi":"10.1109/CLUSTR.2008.4663767","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663767","url":null,"abstract":"Several performance analysis tools support hybrid applications. Most originated as MPI profiling or tracing tools and OpenMP capabilities were added to extend the performance analysis capabilities for the hybrid parallelization case. In this paper we describe our experience with the other path to support both programming paradigms. Our starting point is a profiling tool for OpenMP called ompP that was extended to handle MPI related data. The measured data and the method of presentation follow our focus on the OpenMP side of the performance optimization cycle. For example, the existing overhead classification scheme of ompP was extended to cover time in MPI calls as a new type of overhead.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126432411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Workflows for performance evaluation and tuning 用于性能评估和调优的工作流
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663758
J. Tilson, Mark S. C. Reed, R. Fowler
We report our experiences with using high-throughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach.
我们报告了我们使用高通量技术在网格可访问并行计算机系统集合上运行大型性能实验集的经验,目的是部署最佳编译和配置的科学应用程序。在这些环境中,一组可变参数(编译器、链接和运行时标志);应用程序和库选项;分区大小)可能非常大,因此运行性能集成是一项劳动密集型工作,非常繁琐,而且容易出错。自动化这个过程提高了生产力,减少了部署和维护多平台代码的障碍,并促进了应用程序和系统性能随时间的跟踪。我们描述了运行性能集成系统的设计和实现,并使用两个案例研究作为评估该方法长期潜力的基础。给出了一个原型基准测试系统的体系结构,并对工作流方法的有效性进行了验证。
{"title":"Workflows for performance evaluation and tuning","authors":"J. Tilson, Mark S. C. Reed, R. Fowler","doi":"10.1109/CLUSTR.2008.4663758","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663758","url":null,"abstract":"We report our experiences with using high-throughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"70 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132150299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Gather-arrange-scatter: Node-level request reordering for parallel file systems on multi-core clusters 收集-安排-分散:多核集群上并行文件系统的节点级请求重新排序
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663792
Kazuki Ohta, Hiroya Matsuba, Y. Ishikawa
Multiple processors or multi-core CPUs are now in common, and the number of processes running concurrently is increasing in a cluster. Each process issues contiguous I/O requests individually, but they can be interrupted by the requests of other processes if all the processes enter the I/O phase together. Then, I/O nodes handle these requests as non-contiguous. This increases the disk seek time, and causes performance degradation. To overcome this problem, a node-level request reordering architecture, called gather-arrange-scatter (GAS) architecture, is proposed. In GAS, the I/O requests in the same node are gathered and buffered locally. Then, those are arranged and combined to reduce the I/O cost at I/O nodes, and finally they are scattered to the remote I/O nodes in parallel. A prototype is implemented and evaluated using the BTIO benchmark. This system reduces up to 84.3% of the lseekO calls and reduces up to 93.6% of the number of requests at I/O nodes. This results in up to a 12.7% performance improvement compared to the non-arranged case.
多处理器或多核cpu现在很常见,并且集群中并发运行的进程数量正在增加。每个进程单独发出连续的I/O请求,但如果所有进程一起进入I/O阶段,它们可能会被其他进程的请求中断。然后,I/O节点将这些请求作为不连续的处理。这会增加磁盘寻道时间,并导致性能下降。为了克服这一问题,提出了一种节点级请求重排序体系结构,即收集-安排-分散(GAS)体系结构。在GAS中,同一节点中的I/O请求被收集并在本地进行缓冲。然后,对它们进行排列和组合,以减少I/O节点上的I/O成本,最后将它们并行分散到远程I/O节点上。使用BTIO基准实现和评估原型。该系统减少了高达84.3%的lseekO调用,并减少了高达93.6%的I/O节点请求数量。与未安排的情况相比,这最多可使性能提高12.7%。
{"title":"Gather-arrange-scatter: Node-level request reordering for parallel file systems on multi-core clusters","authors":"Kazuki Ohta, Hiroya Matsuba, Y. Ishikawa","doi":"10.1109/CLUSTR.2008.4663792","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663792","url":null,"abstract":"Multiple processors or multi-core CPUs are now in common, and the number of processes running concurrently is increasing in a cluster. Each process issues contiguous I/O requests individually, but they can be interrupted by the requests of other processes if all the processes enter the I/O phase together. Then, I/O nodes handle these requests as non-contiguous. This increases the disk seek time, and causes performance degradation. To overcome this problem, a node-level request reordering architecture, called gather-arrange-scatter (GAS) architecture, is proposed. In GAS, the I/O requests in the same node are gathered and buffered locally. Then, those are arranged and combined to reduce the I/O cost at I/O nodes, and finally they are scattered to the remote I/O nodes in parallel. A prototype is implemented and evaluated using the BTIO benchmark. This system reduces up to 84.3% of the lseekO calls and reduces up to 93.6% of the number of requests at I/O nodes. This results in up to a 12.7% performance improvement compared to the non-arranged case.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133761549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An optimized Dynamic Load Balancing method for parallel 3-D mesh refinement for finite element electromagnetics with Tetrahedra 四面体有限元电磁学并行三维网格细化的优化动态负载平衡方法
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663804
D. Ren, D. Giannacopoulos, R. Suda
A new Dynamic Load Balancing (DLB) method for automatic performance tuning in parallel, adaptive, 3-D mesh refinement is developed based on study of characteristics of Finite Element Method (FEM) on electromagnetics with tetrahedra. On the top of existing DLB algorithms, the new design optimized the task pool location of each processing element (PE) and the initial data assignments in multiprocessor parallel architecture. To accomplish our method, we investigate it by applying the algorithm in implementations of parallel 3-D Hierarchical Tetrahedra and Octahedra (HTO) mesh refinement. By comparing the benchmark results derived from the performance measures of the new method with the performance results from other two existing DLB algorithms running the same HTO example geometric mesh refinement model and on the same parallel architecture, the benefits of the new method for achieving high performance parallel mesh refinement are demonstrated.
在研究四面体电磁学有限元特性的基础上,提出了一种新的动态负载平衡(DLB)并行、自适应、三维网格细化的自动性能调整方法。在现有DLB算法的基础上,优化了多处理器并行架构中各处理元素的任务池位置和初始数据分配。为了实现我们的方法,我们将该算法应用于并行三维分层四面体和八面体(HTO)网格细化的实现中。通过将新方法的性能指标与其他两种DLB算法在相同HTO实例几何网格细化模型和相同并行架构下的性能结果进行比较,证明了新方法在实现高性能并行网格细化方面的优势。
{"title":"An optimized Dynamic Load Balancing method for parallel 3-D mesh refinement for finite element electromagnetics with Tetrahedra","authors":"D. Ren, D. Giannacopoulos, R. Suda","doi":"10.1109/CLUSTR.2008.4663804","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663804","url":null,"abstract":"A new Dynamic Load Balancing (DLB) method for automatic performance tuning in parallel, adaptive, 3-D mesh refinement is developed based on study of characteristics of Finite Element Method (FEM) on electromagnetics with tetrahedra. On the top of existing DLB algorithms, the new design optimized the task pool location of each processing element (PE) and the initial data assignments in multiprocessor parallel architecture. To accomplish our method, we investigate it by applying the algorithm in implementations of parallel 3-D Hierarchical Tetrahedra and Octahedra (HTO) mesh refinement. By comparing the benchmark results derived from the performance measures of the new method with the performance results from other two existing DLB algorithms running the same HTO example geometric mesh refinement model and on the same parallel architecture, the benefits of the new method for achieving high performance parallel mesh refinement are demonstrated.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114717134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallel multistage preconditioners by Hierarchical Interface Decomposition on “T2K Open Super Computer (Todai Combined Cluster)” with Hybrid parallel programming models 基于混合并行编程模型的T2K开放式超级计算机(Todai组合集群)分层接口分解并行多级预调节器
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663785
K. Nakajima
In this work, parallel preconditioning methods based on ldquoHierarchical Interface Decomposition (HID)rdquo and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the ldquoT2K Open Super Computer (Todai Combined Cluster)rdquo using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.
在这项工作中,基于ldquohierarchyinterface Decomposition (HID)rdquo和混合并行编程模型的并行预处理方法应用于具有非均质材料特性的介质中线性弹性问题的有限元模拟。通过OpenMP应用逆Cuthill-McKee重排序与循环多色(CM-RCM)进行并行化。开发的代码已经在ldquoT2K开放超级计算机(Todai组合集群)rdquo上使用多达512个内核进行了测试。与传统的局部块Jacobi预调节器相比,基于HID的预调节器具有可扩展的性能和鲁棒性。混合4x4并行编程模型的性能与Flat MPI具有竞争力。
{"title":"Parallel multistage preconditioners by Hierarchical Interface Decomposition on “T2K Open Super Computer (Todai Combined Cluster)” with Hybrid parallel programming models","authors":"K. Nakajima","doi":"10.1109/CLUSTR.2008.4663785","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663785","url":null,"abstract":"In this work, parallel preconditioning methods based on ldquoHierarchical Interface Decomposition (HID)rdquo and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the ldquoT2K Open Super Computer (Todai Combined Cluster)rdquo using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117267084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using cluster computing to support automatic and dynamic database clustering 利用集群计算支持自动和动态的数据库集群
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663800
Sylvain Guinepain, L. Gruenwald
Query response time is the number one metrics when it comes to database performance. Because of data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time. Retrieving data from disk is several orders of magnitude slower than retrieving it from memory, it is easy to see the direct correlation between query response time and the number of disk I/Os. One of the common ways to reduce disk I/Os and therefore improve query response time is database clustering, which is a process that partitions the database vertically (attribute clustering) and/or horizontally (record clustering). A clustering is optimized for a given set of queries. However in dynamic systems the queries change with time, the clustering in place becomes obsolete, and the database needs to be re-clustered dynamically. This paper presents an efficient algorithm for attribute clustering that dynamically and automatically generates attribute clusters based on closed item sets mined from the attributes sets found in the queries running against the database. The paper then discusses how this algorithm can be implemented using the cluster computing paradigm to reduce query response time even further through parallelism and data redundancy.
查询响应时间是数据库性能的首要指标。由于数据的激增,高效的访问方法和数据存储技术对于维持可接受的查询响应时间变得越来越重要。从磁盘检索数据要比从内存检索数据慢几个数量级,因此很容易看到查询响应时间与磁盘I/ o数量之间的直接关联。减少磁盘I/ o从而提高查询响应时间的常见方法之一是数据库集群,这是一个垂直(属性集群)和/或水平(记录集群)划分数据库的过程。集群是针对给定的查询集进行优化的。然而,在动态系统中,查询随着时间的变化而变化,就地集群变得过时,数据库需要动态地重新集群。本文提出了一种高效的属性聚类算法,该算法基于从数据库查询中发现的属性集中挖掘的封闭项集,动态自动地生成属性聚类。然后,本文讨论了如何使用集群计算范式实现该算法,从而通过并行性和数据冗余进一步减少查询响应时间。
{"title":"Using cluster computing to support automatic and dynamic database clustering","authors":"Sylvain Guinepain, L. Gruenwald","doi":"10.1109/CLUSTR.2008.4663800","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663800","url":null,"abstract":"Query response time is the number one metrics when it comes to database performance. Because of data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time. Retrieving data from disk is several orders of magnitude slower than retrieving it from memory, it is easy to see the direct correlation between query response time and the number of disk I/Os. One of the common ways to reduce disk I/Os and therefore improve query response time is database clustering, which is a process that partitions the database vertically (attribute clustering) and/or horizontally (record clustering). A clustering is optimized for a given set of queries. However in dynamic systems the queries change with time, the clustering in place becomes obsolete, and the database needs to be re-clustered dynamically. This paper presents an efficient algorithm for attribute clustering that dynamically and automatically generates attribute clusters based on closed item sets mined from the attributes sets found in the queries running against the database. The paper then discusses how this algorithm can be implemented using the cluster computing paradigm to reduce query response time even further through parallelism and data redundancy.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130659775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2008 IEEE International Conference on Cluster Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1