首页 > 最新文献

2008 IEEE International Conference on Cluster Computing最新文献

英文 中文
Implications of non-constant clock drifts for the timestamps of concurrent events 非恒定时钟漂移对并发事件时间戳的影响
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663756
Daniel Becker, R. Rabenseifner, F. Wolf
To support the development of efficient parallel codes on cluster systems, event tracing is a widely used technique with a broad spectrum of applications ranging from performance analysis, performance prediction and modeling to debugging. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between them and/or to establish a total event ordering. Obviously, measuring the time between concurrent events requires a global clock, which often, however, is not available on clusters. Assuming that potentially different drifts of local clocks remain constant over time, linear offset interpolation can be applied postmortem to map local onto global timestamps. In this study, we investigate the robustness of the above assumption using different timers and show that the error of timestamps derived in this way can easily lead to a misrepresentation of the logical event order imposed by the semantics of the underlying communication substrate. We conclude that linear offset interpolation alone may be insufficient for many applications of event tracing and discuss further options.
为了支持在集群系统上开发高效的并行代码,事件跟踪是一种广泛使用的技术,具有广泛的应用范围,从性能分析、性能预测和建模到调试。通常,记录事件以及它们发生的时间,以测量它们之间的时间距离和/或建立一个总的事件顺序。显然,测量并发事件之间的时间需要一个全局时钟,但是在集群上通常不可用。假设本地时钟的潜在不同漂移随时间保持不变,可以应用线性偏移插值,将本地时间映射到全局时间戳。在本研究中,我们使用不同的计时器研究了上述假设的鲁棒性,并表明以这种方式导出的时间戳的错误很容易导致由底层通信基板的语义强加的逻辑事件顺序的错误表示。我们得出结论,对于事件跟踪的许多应用,仅线性偏移插值可能是不够的,并讨论了进一步的选择。
{"title":"Implications of non-constant clock drifts for the timestamps of concurrent events","authors":"Daniel Becker, R. Rabenseifner, F. Wolf","doi":"10.1109/CLUSTR.2008.4663756","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663756","url":null,"abstract":"To support the development of efficient parallel codes on cluster systems, event tracing is a widely used technique with a broad spectrum of applications ranging from performance analysis, performance prediction and modeling to debugging. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between them and/or to establish a total event ordering. Obviously, measuring the time between concurrent events requires a global clock, which often, however, is not available on clusters. Assuming that potentially different drifts of local clocks remain constant over time, linear offset interpolation can be applied postmortem to map local onto global timestamps. In this study, we investigate the robustness of the above assumption using different timers and show that the error of timestamps derived in this way can easily lead to a misrepresentation of the logical event order imposed by the semantics of the underlying communication substrate. We conclude that linear offset interpolation alone may be insufficient for many applications of event tracing and discuss further options.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125707597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Active CoordinaTion (ACT) - toward effectively managing virtualized multicore clouds 主动协调(ACT)——旨在有效地管理虚拟化的多核云
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663752
M. Kesavan, A. Ranadive, Ada Gavrilovska, K. Schwan
A key benefit of utility data centers and cloud computing infrastructure is the level of consolidation they can offer to arbitrary guest applications, and the substantial saving in operational costs and resources that can be derived in the process. However, significant challenges remain before it becomes possible to effectively and at low cost manage virtualized systems, particularly in the face of increasing complexity of individual many-core platforms, and given the dynamic behaviors and resource requirements exhibited by cloud guest VMs. This paper describes the active coordination (ACT) approach, aimed to address a specific issue in the management domain, which is the fact that management actions must (1) typically touch upon multiple resources in order to be effective, and (2) must be continuously refined in order to deal with the dynamism in the platform resource loads. ACT relies on the notion of class-of-service, associated with (sets of) guest VMs, based on which it maps VMs onto platform units, the latter encapsulating sets of platform resources of different types. Using these abstractions, ACT can perform active management in multiple ways, including a VM-specific approach and a black box approach that relies on continuous monitoring of the guest VMs' runtime behavior and on an adaptive resource allocation algorithm, termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle Room. In addition, ACT permits explicit external events to trigger VM or application-specific resource allocations, e.g., leveraging emerging standards such as WSDM. The experimental analysis of the ACT prototype, built for Xen-based platforms, use industry-standard benchmarks, including RUBiS, Hadoop, and SPEC. They demonstrate ACT's ability to efficiently manage the aggregate platform resources according to the guest VMs' relative importance (class-of-service), for both the black-box and the VM-specific approach.
公用事业数据中心和云计算基础设施的一个关键好处是,它们可以为任意客户应用程序提供整合级别,并且可以在此过程中节省大量的运营成本和资源。然而,在以低成本有效地管理虚拟化系统成为可能之前,特别是在面对单个多核平台日益复杂的情况下,以及考虑到云客户机虚拟机所表现出的动态行为和资源需求时,仍然存在重大挑战。本文描述了主动协调(ACT)方法,旨在解决管理领域中的一个特定问题,即管理行动必须(1)通常涉及多个资源才能有效,(2)必须不断改进以处理平台资源负载中的动态。ACT依赖于服务类(class-of-service)的概念,它与客户虚拟机(一组)相关联,在此基础上将虚拟机映射到平台单元,后者封装了不同类型的平台资源集。使用这些抽象,ACT可以以多种方式执行主动管理,包括虚拟机特定的方法和黑盒方法,该方法依赖于对来宾虚拟机运行时行为的持续监控,以及一种自适应资源分配算法,称为带有Wiggle Room的乘法增加,减法减少算法。此外,ACT允许显式的外部事件来触发VM或特定于应用程序的资源分配,例如,利用新兴的标准,如WSDM。为基于xen的平台构建的ACT原型的实验分析使用了行业标准基准,包括RUBiS、Hadoop和SPEC。它们证明了ACT根据客户虚拟机的相对重要性(服务类别)有效管理聚合平台资源的能力,适用于黑箱和虚拟机特定的方法。
{"title":"Active CoordinaTion (ACT) - toward effectively managing virtualized multicore clouds","authors":"M. Kesavan, A. Ranadive, Ada Gavrilovska, K. Schwan","doi":"10.1109/CLUSTR.2008.4663752","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663752","url":null,"abstract":"A key benefit of utility data centers and cloud computing infrastructure is the level of consolidation they can offer to arbitrary guest applications, and the substantial saving in operational costs and resources that can be derived in the process. However, significant challenges remain before it becomes possible to effectively and at low cost manage virtualized systems, particularly in the face of increasing complexity of individual many-core platforms, and given the dynamic behaviors and resource requirements exhibited by cloud guest VMs. This paper describes the active coordination (ACT) approach, aimed to address a specific issue in the management domain, which is the fact that management actions must (1) typically touch upon multiple resources in order to be effective, and (2) must be continuously refined in order to deal with the dynamism in the platform resource loads. ACT relies on the notion of class-of-service, associated with (sets of) guest VMs, based on which it maps VMs onto platform units, the latter encapsulating sets of platform resources of different types. Using these abstractions, ACT can perform active management in multiple ways, including a VM-specific approach and a black box approach that relies on continuous monitoring of the guest VMs' runtime behavior and on an adaptive resource allocation algorithm, termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle Room. In addition, ACT permits explicit external events to trigger VM or application-specific resource allocations, e.g., leveraging emerging standards such as WSDM. The experimental analysis of the ACT prototype, built for Xen-based platforms, use industry-standard benchmarks, including RUBiS, Hadoop, and SPEC. They demonstrate ACT's ability to efficiently manage the aggregate platform resources according to the guest VMs' relative importance (class-of-service), for both the black-box and the VM-specific approach.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"15 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133755358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Variable-grain and dynamic work generation for Minimal Unique Itemset mining 最小唯一项集挖掘的可变粒度和动态工作生成
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663753
Paraskevas Yiapanis, D. Haglin, Anna M. Manning, K. Mayes, J. Keane
SUDA2 is a recursive search algorithm for minimal unique itemset detection. Such sets of items are formed via combinations of non-obvious attributes enabling individual record identification. The nature of SUDA2 allows work to be divided into non-overlapping tasks enabling parallel execution. Earlier work developed a parallel implementation for SUDA2 on an SMP cluster, and this was found to be several orders of magnitude faster than sequential SUDA2. However, if fixed-granularity parallel tasks are scheduled naively in the order of their generation, the system load tends to be imbalanced with little work at the beginning and end of the search. This paper investigates the effectiveness of variable-grained and dynamic work generation strategies for parallel SUDA2. These methods restrict the number of sub-tasks to be generated, based on the criterion of probable work size. The further we descend in the search recursion tree, the smaller the tasks become, thus we only select the largest tasks at each level of recursion as being suitable for scheduling. The revised algorithm runs approximately twice as fast as the existing parallel SUDA2 for finer levels of granularity when variable-grained work generation is applied. The dynamic method, performing level-wise task selection based on size, outperforms the other techniques investigated.
SUDA2是一种用于最小唯一项集检测的递归搜索算法。这样的项目集是通过非明显属性的组合形成的,从而能够识别单个记录。SUDA2的性质允许将工作划分为不重叠的任务,从而实现并行执行。早期的工作为SMP集群上的SUDA2开发了一个并行实现,发现它比顺序SUDA2快几个数量级。然而,如果固定粒度的并行任务是按照它们的生成顺序安排的,那么系统负载往往是不平衡的,在搜索的开始和结束时几乎没有工作。本文研究了可变粒度和动态工作生成策略在并行SUDA2中的有效性。这些方法根据可能的工作大小来限制要生成的子任务的数量。在搜索递归树中越往下,任务越小,因此我们只选择每个递归层中最大的任务作为适合调度的任务。当应用可变粒度的工作生成时,修订后的算法的运行速度大约是现有并行SUDA2的两倍,可以实现更细粒度的级别。动态方法,执行基于大小的分层任务选择,优于其他研究的技术。
{"title":"Variable-grain and dynamic work generation for Minimal Unique Itemset mining","authors":"Paraskevas Yiapanis, D. Haglin, Anna M. Manning, K. Mayes, J. Keane","doi":"10.1109/CLUSTR.2008.4663753","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663753","url":null,"abstract":"SUDA2 is a recursive search algorithm for minimal unique itemset detection. Such sets of items are formed via combinations of non-obvious attributes enabling individual record identification. The nature of SUDA2 allows work to be divided into non-overlapping tasks enabling parallel execution. Earlier work developed a parallel implementation for SUDA2 on an SMP cluster, and this was found to be several orders of magnitude faster than sequential SUDA2. However, if fixed-granularity parallel tasks are scheduled naively in the order of their generation, the system load tends to be imbalanced with little work at the beginning and end of the search. This paper investigates the effectiveness of variable-grained and dynamic work generation strategies for parallel SUDA2. These methods restrict the number of sub-tasks to be generated, based on the criterion of probable work size. The further we descend in the search recursion tree, the smaller the tasks become, thus we only select the largest tasks at each level of recursion as being suitable for scheduling. The revised algorithm runs approximately twice as fast as the existing parallel SUDA2 for finer levels of granularity when variable-grained work generation is applied. The dynamic method, performing level-wise task selection based on size, outperforms the other techniques investigated.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132251313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
SPRAT: Runtime processor selection for energy-aware computing SPRAT:用于能量感知计算的运行时处理器选择
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663799
H. Takizawa, Katsuto Sato, Hiroaki Kobayashi
A commodity personal computer (PC) can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and a graphics processing unit (GPU). Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at the runtime, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper presents a runtime environment that dynamically selects an appropriate processor so as to improve the energy efficiency. The evaluation results clearly indicate that the runtime processor selection at executing each kernel with given data streams is promising for energy-aware computing on a hybrid computing system.
商用个人计算机(PC)可以看作是一个混合计算系统,配备了两种不同的处理器,即CPU和图形处理单元(GPU)。由于gpu在性能和能效方面的优势很大程度上取决于系统配置和运行时确定的数据大小,因此程序员并不总是知道应该使用哪个处理器来执行某个内核。因此,本文提出了一种动态选择合适处理器的运行环境,以提高能效。评估结果清楚地表明,使用给定的数据流执行每个内核时的运行时处理器选择对于混合计算系统上的能量感知计算是有希望的。
{"title":"SPRAT: Runtime processor selection for energy-aware computing","authors":"H. Takizawa, Katsuto Sato, Hiroaki Kobayashi","doi":"10.1109/CLUSTR.2008.4663799","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663799","url":null,"abstract":"A commodity personal computer (PC) can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and a graphics processing unit (GPU). Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at the runtime, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper presents a runtime environment that dynamically selects an appropriate processor so as to improve the energy efficiency. The evaluation results clearly indicate that the runtime processor selection at executing each kernel with given data streams is promising for energy-aware computing on a hybrid computing system.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Redistribution aware two-step scheduling for mixed-parallel applications 混合并行应用程序的支持重分发的两步调度
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663755
S. Hunold, T. Rauber, F. Suter
Applications raising in many scientific fields exhibit both data and task parallelism that have to be exploited efficiently. A classic approach is to structure those applications by a task graph whose nodes represent parallel computations. Scheduling such mixed-parallel applications is challenging even on a single homogeneous platform, such as a cluster. Most of the mixed-parallel application scheduling algorithms rely on two decoupled steps: allocation and mapping. This separation can induce unnecessary or costly data redistributions that have an impact on the overall performance. This is particularly true for data intensive applications. In this paper, we propose an original approach in which the allocations determined in the first step can be adapted during the second step in order to minimize the impact of these data redistributions. Two redistribution aware mapping strategies are detailed and a study of their impact on the schedule length is proposed through a comparison with an efficient two step algorithm over a broad range of experimental scenarios.
在许多科学领域提出的应用程序都表现出数据和任务并行性,必须有效地利用它们。一种经典的方法是通过任务图来构建这些应用程序,任务图的节点表示并行计算。即使在单一的同构平台(如集群)上调度这种混合并行应用程序也是一项挑战。大多数混合并行应用程序调度算法依赖于两个解耦的步骤:分配和映射。这种分离可能导致不必要的或代价高昂的数据重新分布,从而影响整体性能。对于数据密集型应用程序尤其如此。在本文中,我们提出了一种新颖的方法,其中在第一步中确定的分配可以在第二步中进行调整,以尽量减少这些数据重分布的影响。详细介绍了两种重分配感知映射策略,并在广泛的实验场景下,通过与一种高效的两步算法的比较,研究了它们对调度长度的影响。
{"title":"Redistribution aware two-step scheduling for mixed-parallel applications","authors":"S. Hunold, T. Rauber, F. Suter","doi":"10.1109/CLUSTR.2008.4663755","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663755","url":null,"abstract":"Applications raising in many scientific fields exhibit both data and task parallelism that have to be exploited efficiently. A classic approach is to structure those applications by a task graph whose nodes represent parallel computations. Scheduling such mixed-parallel applications is challenging even on a single homogeneous platform, such as a cluster. Most of the mixed-parallel application scheduling algorithms rely on two decoupled steps: allocation and mapping. This separation can induce unnecessary or costly data redistributions that have an impact on the overall performance. This is particularly true for data intensive applications. In this paper, we propose an original approach in which the allocations determined in the first step can be adapted during the second step in order to minimize the impact of these data redistributions. Two redistribution aware mapping strategies are detailed and a study of their impact on the schedule length is proposed through a comparison with an efficient two step algorithm over a broad range of experimental scenarios.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127086804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Multi-core aware optimization for MPI collectives MPI集合的多核感知优化
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663789
Bibo Tu, Ming Zou, Jianfeng Zhan, Xiaofang Zhao, Jianping Fan
MPI collective operations on multi-core clusters should be multi-core aware. In this paper, collective algorithms with hierarchical virtual topology focus on the performance difference among different communication levels on multi-core clusters, simply for intra-node and inter-node communication; Furthermore, to select befitting segment sizes for intra-node collective communication can cater to cache hierarchy in multi-core processors. Based on existing collective algorithms in MPICH2, above two techniques construct portable optimization methodology over MPICH2 for collective operations on multi-core clusters. Conforming to above optimization methodology, multi-core aware broadcast algorithm has been implemented and evaluated as a case study. The results of performance evaluation show that the multi-core aware optimization methodology over MPICH2 is efficient.
多核集群上的MPI集体操作应该是多核感知的。在本文中,基于分层虚拟拓扑的集体算法主要关注多核集群中不同通信级别之间的性能差异,仅针对节点内和节点间的通信;此外,为节点内集体通信选择合适的段大小可以适应多核处理器的缓存层次结构。基于MPICH2中已有的集合算法,上述两种技术构建了基于MPICH2的可移植的多核集群集合操作优化方法。在此基础上,以多核感知广播算法为例进行了实现和评价。性能评估结果表明,基于MPICH2的多核感知优化方法是有效的。
{"title":"Multi-core aware optimization for MPI collectives","authors":"Bibo Tu, Ming Zou, Jianfeng Zhan, Xiaofang Zhao, Jianping Fan","doi":"10.1109/CLUSTR.2008.4663789","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663789","url":null,"abstract":"MPI collective operations on multi-core clusters should be multi-core aware. In this paper, collective algorithms with hierarchical virtual topology focus on the performance difference among different communication levels on multi-core clusters, simply for intra-node and inter-node communication; Furthermore, to select befitting segment sizes for intra-node collective communication can cater to cache hierarchy in multi-core processors. Based on existing collective algorithms in MPICH2, above two techniques construct portable optimization methodology over MPICH2 for collective operations on multi-core clusters. Conforming to above optimization methodology, multi-core aware broadcast algorithm has been implemented and evaluated as a case study. The results of performance evaluation show that the multi-core aware optimization methodology over MPICH2 is efficient.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116953416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Impact of topology and link aggregation on a PC cluster with Ethernet 拓扑和链路聚合对以太网PC集群的影响
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663782
Takafumi Watanabe, M. Nakao, T. Hiroyasu, Tomohiro Otsuka, M. Koibuchi
In addition to its use in local area networks, Ethernet has been used for connecting hosts in the area of high-performance computing. Here, we investigated the impact of topology and link aggregation on a large-scale PC cluster with Ethernet. Ethernet topology that allows loops and its routing can be implemented by the VLAN routing method without creating broadcast storms. To simplify the system configuration without modifying system software, the VLAN tag is added to a frame at switches in our implementation of topologies. Each host creates VLAN interfaces that have different local network addresses on a physical interface, so that a switch learns the MAC addresses of hosts in a PC cluster by broadcast. Evaluation results showed that the performance characteristics of an eight-switch network are comparable to those of an ideal 1-switch (full crossbar) network in the execution of high-performance LINPACK benchmark (HPL) on a 225-host PC cluster. On the other hand, evaluation results using NAS parallel benchmarks indicated that topologies achieved by the proposed methodology showed performance improvements of up to about 650% as compared to the simple tree topology. These results indicate that topology and link aggregation have marked impacts and commodity switches can be used instead of expensive and high functional switches.
除了在局域网中使用外,以太网还用于在高性能计算领域连接主机。在这里,我们研究了拓扑和链路聚合对具有以太网的大型PC集群的影响。允许环路及其路由的以太网拓扑可以通过VLAN路由方式实现,而不会产生广播风暴。为了在不修改系统软件的情况下简化系统配置,在我们的拓扑实现中,VLAN标签被添加到交换机的帧中。每台主机在一个物理接口上创建具有不同本地网络地址的VLAN接口,使交换机通过广播学习PC集群中主机的MAC地址。评估结果表明,在225个主机的PC集群上执行高性能LINPACK基准测试(HPL)时,8个交换机网络的性能特征与理想的1个交换机(全交叉条)网络的性能特征相当。另一方面,使用NAS并行基准测试的评估结果表明,与简单的树型拓扑相比,采用该方法实现的拓扑的性能提高了约650%。这些结果表明,拓扑和链路聚合具有显著的影响,可以使用商品交换机代替昂贵的高功能交换机。
{"title":"Impact of topology and link aggregation on a PC cluster with Ethernet","authors":"Takafumi Watanabe, M. Nakao, T. Hiroyasu, Tomohiro Otsuka, M. Koibuchi","doi":"10.1109/CLUSTR.2008.4663782","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663782","url":null,"abstract":"In addition to its use in local area networks, Ethernet has been used for connecting hosts in the area of high-performance computing. Here, we investigated the impact of topology and link aggregation on a large-scale PC cluster with Ethernet. Ethernet topology that allows loops and its routing can be implemented by the VLAN routing method without creating broadcast storms. To simplify the system configuration without modifying system software, the VLAN tag is added to a frame at switches in our implementation of topologies. Each host creates VLAN interfaces that have different local network addresses on a physical interface, so that a switch learns the MAC addresses of hosts in a PC cluster by broadcast. Evaluation results showed that the performance characteristics of an eight-switch network are comparable to those of an ideal 1-switch (full crossbar) network in the execution of high-performance LINPACK benchmark (HPL) on a 225-host PC cluster. On the other hand, evaluation results using NAS parallel benchmarks indicated that topologies achieved by the proposed methodology showed performance improvements of up to about 650% as compared to the simple tree topology. These results indicate that topology and link aggregation have marked impacts and commodity switches can be used instead of expensive and high functional switches.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Jetter: a multi-pattern parallel I/O benchmark 一个多模式并行I/O基准
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663808
Liqiang Cao, Hongbing Luo, Baoyin Zhang
This paper proposes a new multi-pattern parallel I/O benchmark called Jetter, which evaluates parallel I/O throughput with either the contiguous I/O pattern or the non-contiguous I/O pattern, in either the share-one-file model or the file-per-process model, by either the POSIX interface or the MPI-I/O interface. Jetter helps end users make sense of the pattern performance law, and helps them develop efficient applications in a platform. We have evaluated the parallel I/O bandwidth in a 32 CPU shared memory computer with Jetter. The results show that I/O pattern determines throughput. Optimizing I/O model, interface, etc. in a pattern will improve bandwidth 2 or 3 times.
本文提出了一个新的多模式并行I/O基准,称为Jetter,它通过POSIX接口或MPI-I/O接口,在共享一个文件模型或每进程文件模型中,以连续I/O模式或非连续I/O模式评估并行I/O吞吐量。Jetter帮助最终用户理解模式性能规律,并帮助他们在平台上开发高效的应用程序。我们用Jetter对一台32 CPU共享内存计算机的并行I/O带宽进行了评估。结果表明,I/O模式决定了吞吐量。在模式中优化I/O模型、接口等将使带宽提高2到3倍。
{"title":"Jetter: a multi-pattern parallel I/O benchmark","authors":"Liqiang Cao, Hongbing Luo, Baoyin Zhang","doi":"10.1109/CLUSTR.2008.4663808","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663808","url":null,"abstract":"This paper proposes a new multi-pattern parallel I/O benchmark called Jetter, which evaluates parallel I/O throughput with either the contiguous I/O pattern or the non-contiguous I/O pattern, in either the share-one-file model or the file-per-process model, by either the POSIX interface or the MPI-I/O interface. Jetter helps end users make sense of the pattern performance law, and helps them develop efficient applications in a platform. We have evaluated the parallel I/O bandwidth in a 32 CPU shared memory computer with Jetter. The results show that I/O pattern determines throughput. Optimizing I/O model, interface, etc. in a pattern will improve bandwidth 2 or 3 times.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128810187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing write performance of a shared-disk cluster filesystem through a fine-grained locking strategy 通过细粒度锁定策略提高共享磁盘集群文件系统的写性能
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663809
Paulo Afonso Lopes, P. Medeiros
We present part of our recent work on performance enhancement of cluster file systems using shared disks over a SAN. This work is built around the proposal of pCFS, a file system specifically targeting those environments. In we presented the objectives and design principles of pCFS and a proof-of-concept implementation, carried out by modifying Red Hatpsilas GFS , showing significant improvements in operations over files shared among processes running in different nodes. pCFS differs from GFS in two main aspects: its use of cooperative caching and a finer grain of locking. The first aspect, which used the LAN to enhance performance in write sharing situations, was described elsewhere ; we now introduce a complementary strategy - locking file regions instead of the whole file - which enables us to use the SAN while delivering a high level of performance in those same write sharing situations. pCFS may apply inter-node locks to regions, allowing processes to operate in parallel with a minimum of coherency overhead among nodes; a process cannot access outside its region(s) and, when a writer unlocks a region, others can then lock it and be able to see modified data immediately. Through a set of experiments where a file is shared between processes running in different nodes, we show that the described approach allows a gain of, at least, an order of magnitude over plain GFS.
我们介绍了我们最近在使用SAN上的共享磁盘的集群文件系统的性能增强方面的部分工作。这项工作是围绕pCFS的建议构建的,pCFS是专门针对这些环境的文件系统。在本文中,我们介绍了pCFS的目标和设计原则,并通过修改Red Hatpsilas GFS进行了概念验证实现,显示了在不同节点上运行的进程之间共享文件的操作方面的显着改进。pCFS与GFS在两个主要方面有所不同:使用协作缓存和更细粒度的锁定。第一个方面,在写共享情况下使用局域网来提高性能,在其他地方描述;现在我们引入一种补充策略——锁定文件区域而不是整个文件——这使我们能够在使用SAN的同时,在相同的写共享情况下提供高水平的性能。pCFS可以对区域应用节点间锁,允许进程以最小的节点间一致性开销并行操作;进程不能访问其区域之外的数据,当写入器解锁一个区域时,其他进程可以锁定该区域,并能够立即看到修改后的数据。通过在不同节点上运行的进程之间共享文件的一组实验,我们表明所描述的方法比普通GFS至少可以获得一个数量级的增益。
{"title":"Enhancing write performance of a shared-disk cluster filesystem through a fine-grained locking strategy","authors":"Paulo Afonso Lopes, P. Medeiros","doi":"10.1109/CLUSTR.2008.4663809","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663809","url":null,"abstract":"We present part of our recent work on performance enhancement of cluster file systems using shared disks over a SAN. This work is built around the proposal of pCFS, a file system specifically targeting those environments. In we presented the objectives and design principles of pCFS and a proof-of-concept implementation, carried out by modifying Red Hatpsilas GFS , showing significant improvements in operations over files shared among processes running in different nodes. pCFS differs from GFS in two main aspects: its use of cooperative caching and a finer grain of locking. The first aspect, which used the LAN to enhance performance in write sharing situations, was described elsewhere ; we now introduce a complementary strategy - locking file regions instead of the whole file - which enables us to use the SAN while delivering a high level of performance in those same write sharing situations. pCFS may apply inter-node locks to regions, allowing processes to operate in parallel with a minimum of coherency overhead among nodes; a process cannot access outside its region(s) and, when a writer unlocks a region, others can then lock it and be able to see modified data immediately. Through a set of experiments where a file is shared between processes running in different nodes, we show that the described approach allows a gain of, at least, an order of magnitude over plain GFS.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133092736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Combining Virtual Machine migration with process migration for HPC on multi-clusters and Grids 结合虚拟机迁移和进程迁移的HPC多集群和网格
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663759
Tal Maoz, A. Barak, Lior Amar
The renewed interest in virtualization gives rise to new opportunities for running high performance computing (HPC) applications on clusters and grids. These include the ability to create a uniform (virtual) run-time environment on top of a multitude of hardware and software platforms, and the possibility for dynamic resource allocation towards the improvement of process performance, e.g., by virtual machine (VM) migration as a means for load-balancing. This paper deals with issues related to running HPC applications on multi-clusters and grids using VMware, a virtualization package running on Windows, Linux and OS X. The paper presents the ldquoJobrunrdquo system for transparent, on-demand VM launching upon job submission, and its integration with the MOSIX cluster and grid management system. We present a novel approach to job migration, combining VM migration with process migration using Jobrun, by which it is possible to migrate groups of processes and parallel jobs among different clusters in a multi-cluster or in a grid. We use four real HPC applications to evaluate the overheads of VMware (both on Linux and Windows), the MOSIX cluster extensions and their combination, and present detailed measurements of the performance of Jobrun.
对虚拟化的重新关注为在集群和网格上运行高性能计算(HPC)应用程序提供了新的机会。其中包括在众多硬件和软件平台之上创建统一(虚拟)运行时环境的能力,以及为改进进程性能而进行动态资源分配的可能性,例如,通过虚拟机(VM)迁移作为负载平衡的一种手段。本文讨论了使用VMware(一个运行在Windows、Linux和OS x上的虚拟化软件包)在多集群和网格上运行HPC应用程序的相关问题。本文提出了ldquoJobrunrdquo系统,用于在作业提交时透明地、按需启动VM,并与MOSIX集群和网格管理系统集成。我们提出了一种新的作业迁移方法,将虚拟机迁移与使用Jobrun的进程迁移相结合,通过该方法可以在多集群或网格中的不同集群中迁移进程组和并行作业。我们使用四个真实的HPC应用程序来评估VMware (Linux和Windows)、MOSIX集群扩展及其组合的开销,并给出Jobrun性能的详细测量。
{"title":"Combining Virtual Machine migration with process migration for HPC on multi-clusters and Grids","authors":"Tal Maoz, A. Barak, Lior Amar","doi":"10.1109/CLUSTR.2008.4663759","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663759","url":null,"abstract":"The renewed interest in virtualization gives rise to new opportunities for running high performance computing (HPC) applications on clusters and grids. These include the ability to create a uniform (virtual) run-time environment on top of a multitude of hardware and software platforms, and the possibility for dynamic resource allocation towards the improvement of process performance, e.g., by virtual machine (VM) migration as a means for load-balancing. This paper deals with issues related to running HPC applications on multi-clusters and grids using VMware, a virtualization package running on Windows, Linux and OS X. The paper presents the ldquoJobrunrdquo system for transparent, on-demand VM launching upon job submission, and its integration with the MOSIX cluster and grid management system. We present a novel approach to job migration, combining VM migration with process migration using Jobrun, by which it is possible to migrate groups of processes and parallel jobs among different clusters in a multi-cluster or in a grid. We use four real HPC applications to evaluate the overheads of VMware (both on Linux and Windows), the MOSIX cluster extensions and their combination, and present detailed measurements of the performance of Jobrun.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131610452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
期刊
2008 IEEE International Conference on Cluster Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1