2011 IEEE International Parallel & Distributed Processing Symposium最新文献

Power and Performance Management in Priority-Type Cluster Computing Systems 优先级型集群计算系统的电源和性能管理

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.13

Kaiqi Xiong

Cluster computing not only improves performance but also increase power consumption. %over that of a single computer. It is a challenge to increase the performance of a cluster computing system and reduce its power consumption simultaneously. In this paper, we consider a collection of cluster computing resources owned by a service provider to host an enterprise application for multiple class business customers where customer requests are distinguished, with different request characteristics and service requirements. We start with a development of computing an average end-to-end delay and an average energy consumption for multiple class customers in such an application. Then, we present approaches for optimizing the average end-to-end delay subject to the constraint of an average energy consumption and optimizing the average end-to-end energy consumption subject to the constraints of an average end-to-end delay for all class and each class customer requests respectively. Moreover, a service provider processes the service requests of customers according to a service level agreement (SLA), which is a contract agreed between a customer and a service provider. It becomes important and commonplace to prioritize multiple customer services in favor of customers who are willing to pay higher fees. We propose an approach for minimizing the total cost of cluster computing resources allocated to ensure multiple priority customer service guarantees by the service provider. It is demonstrated through our simulation that the proposed approaches are efficient and accurate for power management and performance guarantees in priority-type cluster computing systems.

集群计算不仅提高了性能，也增加了功耗。比单台计算机多%。如何在提高集群计算系统性能的同时降低集群计算系统的功耗是一个挑战。在本文中，我们考虑由服务提供商拥有的集群计算资源集合来为多个类别的业务客户托管企业应用程序，其中客户请求是不同的，具有不同的请求特征和服务需求。我们从计算这种应用程序中多个类别客户的平均端到端延迟和平均能耗的开发开始。然后，我们分别针对所有类别和每个类别的客户请求，提出了在平均能耗约束下优化平均端到端延迟和在平均端到端延迟约束下优化平均端到端能耗的方法。此外，服务提供商根据服务水平协议(SLA)处理客户的服务请求，SLA是客户和服务提供商之间达成的契约。优先考虑多种客户服务，以支持愿意支付更高费用的客户，这变得非常重要和普遍。我们提出了一种最小化分配集群计算资源的总成本的方法，以确保服务提供商的多个优先客户服务保证。仿真结果表明，本文提出的方法对于优先级型集群计算系统的电源管理和性能保证是有效和准确的。

{"title":"Power and Performance Management in Priority-Type Cluster Computing Systems","authors":"Kaiqi Xiong","doi":"10.1109/IPDPS.2011.13","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.13","url":null,"abstract":"Cluster computing not only improves performance but also increase power consumption. %over that of a single computer. It is a challenge to increase the performance of a cluster computing system and reduce its power consumption simultaneously. In this paper, we consider a collection of cluster computing resources owned by a service provider to host an enterprise application for multiple class business customers where customer requests are distinguished, with different request characteristics and service requirements. We start with a development of computing an average end-to-end delay and an average energy consumption for multiple class customers in such an application. Then, we present approaches for optimizing the average end-to-end delay subject to the constraint of an average energy consumption and optimizing the average end-to-end energy consumption subject to the constraints of an average end-to-end delay for all class and each class customer requests respectively. Moreover, a service provider processes the service requests of customers according to a service level agreement (SLA), which is a contract agreed between a customer and a service provider. It becomes important and commonplace to prioritize multiple customer services in favor of customers who are willing to pay higher fees. We propose an approach for minimizing the total cost of cluster computing resources allocated to ensure multiple priority customer service guarantees by the service provider. It is demonstrated through our simulation that the proposed approaches are efficient and accurate for power management and performance guarantees in priority-type cluster computing systems.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115407499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fast Community Detection Algorithm with GPUs and Multicore Architectures 基于gpu和多核架构的快速社区检测算法

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.61

Jyothish Soman, A. Narang

In this paper, we present the design of a novel scalable parallel algorithm for community detection optimized for multi-core and GPU architectures. Our algorithm is based on label propagation, which works solely on local information, thus giving it the scalability advantage over conventional approaches. We also show that weighted label propagation can overcome typical quality issues in communities detected with label propagation. Experimental results on well known massive scale graphs such as Wikipedia (100M edges) and also on RMAT graphs with 10M - 40M edges, demonstrate the superior performance and scalability of our algorithm compared to the well known approaches for community detection. On the textit{hep-th} graph ($352$K edges) and the textit{wikipedia} graph ($100$M edges), using Power 6 architecture with $32$ cores, our algorithm achieves one to two orders of magnitude better performance compared to the best known prior results on parallel architectures with similar number of CPUs. Further, our GPGPU based algorithm achieves $8times$ improvement over the Power 6 performance on $40$M edge R-MAT graph. Alongside, we achieve high quality (modularity) of communities detected, with experimental evidence from well-known graphs such as Zachary karate club, Dolphin network and Football club, where we achieve modularity that is close to the best known alternatives. To the best of our knowledge these are best known results for community detection on massive graphs ($100$M edges) in terms of performance and also quality vs. performance trade-off. This is also a unique work on community detection on GPGPUs with scalable performance.

在本文中，我们提出了一种新的可扩展的并行算法，用于针对多核和GPU架构进行优化的社区检测。我们的算法基于标签传播，它只对本地信息起作用，因此与传统方法相比，它具有可扩展性优势。我们还表明，加权标签传播可以克服用标签传播检测到的社区中的典型质量问题。在众所周知的大规模图(如Wikipedia (100M边)和RMAT图(10M - 40M边)上的实验结果表明，与已知的社区检测方法相比，我们的算法具有优越的性能和可扩展性。在help textit{-th}图($352$ K条边)和textit{wikipedia}图($100$ M条边)上，使用具有$32$核的Power 6架构，我们的算法与具有相似cpu数量的并行架构上最知名的先前结果相比，实现了一到两个数量级的性能提升。此外，我们基于GPGPU的算法在$40$ M边R-MAT图上实现了$8times$比Power 6性能的改进。此外，我们还实现了社区检测的高质量(模块化)，通过来自Zachary空手道俱乐部、海豚网络和足球俱乐部等知名图表的实验证据，我们实现了与最知名替代方案接近的模块化。据我们所知，就性能和质量与性能权衡而言，这些是在海量图($100$ M边)上进行社区检测的最知名结果。这也是在具有可扩展性能的gpgpu上进行社区检测的一项独特工作。

{"title":"Fast Community Detection Algorithm with GPUs and Multicore Architectures","authors":"Jyothish Soman, A. Narang","doi":"10.1109/IPDPS.2011.61","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.61","url":null,"abstract":"In this paper, we present the design of a novel scalable parallel algorithm for community detection optimized for multi-core and GPU architectures. Our algorithm is based on label propagation, which works solely on local information, thus giving it the scalability advantage over conventional approaches. We also show that weighted label propagation can overcome typical quality issues in communities detected with label propagation. Experimental results on well known massive scale graphs such as Wikipedia (100M edges) and also on RMAT graphs with 10M - 40M edges, demonstrate the superior performance and scalability of our algorithm compared to the well known approaches for community detection. On the textit{hep-th} graph ($352$K edges) and the textit{wikipedia} graph ($100$M edges), using Power 6 architecture with $32$ cores, our algorithm achieves one to two orders of magnitude better performance compared to the best known prior results on parallel architectures with similar number of CPUs. Further, our GPGPU based algorithm achieves $8times$ improvement over the Power 6 performance on $40$M edge R-MAT graph. Alongside, we achieve high quality (modularity) of communities detected, with experimental evidence from well-known graphs such as Zachary karate club, Dolphin network and Football club, where we achieve modularity that is close to the best known alternatives. To the best of our knowledge these are best known results for community detection on massive graphs ($100$M edges) in terms of performance and also quality vs. performance trade-off. This is also a unique work on community detection on GPGPUs with scalable performance.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123086666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation 等能源效率:一种功率受限的并行计算方法

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.22

S. Song, Chun-Yi Su, Rong Ge, Abhinav Vishnu, K. Cameron

Future large scale high performance supercomputer systems require high energy efficiency to achieve exaflops computational power and beyond. Despite the need to understand energy efficiency in high-performance systems, there are few techniques to evaluate energy efficiency at scale. In this paper, we propose a system-level iso-energy-efficiency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy efficiency and isolate efficient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-efficiency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making.

未来的大规模高性能超级计算机系统需要高能效才能实现百亿次浮点运算甚至更高的计算能力。尽管需要了解高性能系统的能源效率，但很少有技术可以大规模评估能源效率。在本文中，我们提出了一个系统级等能效模型，用于分析、评估和预测在大规模功率感知集群上运行的具有各种执行模式的数据密集型并行应用程序的能源性能。我们的分析模型可以帮助用户探索机器和应用程序依赖特性对系统能源效率的影响，并隔离有效的方法来扩展系统参数(例如处理器数量，CPU功率/频率，工作负载大小和网络带宽)，以平衡能源使用和性能。我们推导了我们的等能效模型，并将其应用于两个功率感知集群上的NAS并行基准测试。我们的结果表明，该模型准确地预测了具有各种执行和通信模式的并行应用程序的总系统能耗，平均误差在5%以内。我们演示了该模型在各种应用程序上下文和可伸缩性决策中的有效使用。

{"title":"Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation","authors":"S. Song, Chun-Yi Su, Rong Ge, Abhinav Vishnu, K. Cameron","doi":"10.1109/IPDPS.2011.22","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.22","url":null,"abstract":"Future large scale high performance supercomputer systems require high energy efficiency to achieve exaflops computational power and beyond. Despite the need to understand energy efficiency in high-performance systems, there are few techniques to evaluate energy efficiency at scale. In this paper, we propose a system-level iso-energy-efficiency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy efficiency and isolate efficient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-efficiency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126856307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes 三个科学应用类达到1 Exaflop/s的架构限制

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.18

A. Bhatele, Pritish Jetley, Hormozd Gahvari, Lukasz Wesolowski, W. Gropp, L. Kalé

The first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in order to attain Exaflop/s performance. This paper presents a feasibility study of three important application classes to formulate the constraints that these classes will impose on the machine architecture for achieving a sustained performance of 1 Exaflop/s. The application classes being considered in this paper are -- classical molecular dynamics, cosmological simulations and unstructured grid computations (finite element solvers). We analyze the problem sizes required for representative algorithms in each class to achieve 1 Exaflop/s and the hardware requirements in terms of the network and memory. Based on the analysis for achieving an Exaflop/s, we also discuss the performance of these algorithms for much smaller problem sizes.

第一台每秒万亿次浮点运算的计算机ASCI Red于1997年投入使用，而IBM Roadrunner花了11年多的时间才出现在Top500榜单上。人们已经开始研究制造百亿亿次计算机所面临的硬件和软件挑战。为了达到Exaflop/s的性能，理解并应对这些挑战是很重要的。本文提出了三个重要应用类的可行性研究，以制定这些类对机器架构施加的约束，以实现1 Exaflop/s的持续性能。本文考虑的应用类是——经典分子动力学、宇宙学模拟和非结构网格计算(有限元求解器)。我们分析了每一类代表性算法达到1 Exaflop/s所需的问题大小，以及在网络和内存方面的硬件要求。基于实现Exaflop/s的分析，我们还讨论了这些算法在更小的问题规模下的性能。

{"title":"Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes","authors":"A. Bhatele, Pritish Jetley, Hormozd Gahvari, Lukasz Wesolowski, W. Gropp, L. Kalé","doi":"10.1109/IPDPS.2011.18","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.18","url":null,"abstract":"The first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in order to attain Exaflop/s performance. This paper presents a feasibility study of three important application classes to formulate the constraints that these classes will impose on the machine architecture for achieving a sustained performance of 1 Exaflop/s. The application classes being considered in this paper are -- classical molecular dynamics, cosmological simulations and unstructured grid computations (finite element solvers). We analyze the problem sizes required for representative algorithms in each class to achieve 1 Exaflop/s and the hardware requirements in terms of the network and memory. Based on the analysis for achieving an Exaflop/s, we also discuss the performance of these algorithms for much smaller problem sizes.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"376 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122452720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Efficient Parallel Scheduling of Malleable Tasks 可延展任务的高效并行调度

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.110

P. Sanders, Jochen Speck

We give an $O(n + min{n,m} log{m})$ work algorithm for scheduling $n$ tasks with flexible amount of parallelism on $m$ processors, provided the speedup functions of the tasks are concave. We give efficient parallelizations of the algorithm that run in polylogarithmic time. Previous algorithms were sequential and required quadratic work. This is in some sense a best-possible result since the problem is NP-hard for more general speedup functions.

在任务加速函数为凹形的条件下，给出了在$m$处理器上调度具有灵活并行度的$n$任务的$O(n + min{n,m} log{m})$工作算法。我们给出了在多对数时间内运行的算法的有效并行化。以前的算法是顺序的，需要做二次功。从某种意义上说，这是最好的结果，因为这个问题对于更一般的加速函数来说是np困难的。

引用次数: 14

The Evaluation of an Effective Out-of-Core Run-Time System in the Context of Parallel Mesh Generation 并行网格生成环境下有效的离核运行时系统评价

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.25

A. Kot, Andrey N. Chernikov, N. Chrisochoides

We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies reflecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simplifies and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacrificing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with significantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead~(up to 18% on most configurations) when computing in-core which means performance is not compromised.

我们提出了一个核外运行时系统，支持大型不规则和自适应问题的有效并行计算，特别是非结构化网格生成(PUMG)。由于密集的内存访问、不可预测的通信模式以及反映网格元素非结构化空间连通性的可变和不规则数据依赖关系，PUMG是一个极具挑战性的应用。我们的运行时系统允许通过将内存利用率扩展到核外级别，将并行应用程序的内存占用从宽而浅转换为窄而深。它简化和流线化了原本非常耗时的核心外应用程序的开发，以及现有应用程序的转换。它利用磁盘、网络和内存层次结构，在不牺牲PUMG性能的情况下实现计算资源的高利用率。运行时系统结合了不同的编程范例:使用工业强度软件框架的节点内多线程，节点之间的单侧活动消息，以及用于管理大型数据集的核心外子系统。我们在传统的并行平台上进行了评估，使用三种不同的PUMG方法对运行时系统的所有层进行了压力测试，这些方法具有显著不同的通信和同步模式。我们展示了在计算、通信和磁盘I/O方面的高度重叠，这在计算大型核外问题时可以带来良好的性能。当内核计算时，运行时系统增加了非常小的开销(在大多数配置中最多为18%)，这意味着性能不会受到损害。

{"title":"The Evaluation of an Effective Out-of-Core Run-Time System in the Context of Parallel Mesh Generation","authors":"A. Kot, Andrey N. Chernikov, N. Chrisochoides","doi":"10.1109/IPDPS.2011.25","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.25","url":null,"abstract":"We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies reflecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simplifies and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacrificing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with significantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead~(up to 18% on most configurations) when computing in-core which means performance is not compromised.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128786190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Novel Power Management for CMP Systems in Data-Intensive Environment 数据密集型环境下CMP系统的一种新型电源管理方法

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.19

Pengju Shang, Jun Wang

The emerging data-intensive applications of today are comprised of non-uniform CPU and I/O intensive workloads, thus imposing a requirement to consider both CPU and I/O effects in the power management strategies. Only scaling down the processor's frequency based on its busy/idle ratio cannot fully exploit opportunities of saving power. Our experiments show that besides the busy and idle status, each processor may also have I/O wait phases waiting for I/O operations to complete. During this period, the completion time is decided by the I/O subsystem rather than the CPU thus scaling the processor to a lower frequency will not affect the performance but save more power. In addition, the CPU's reaction to the I/O operations may be significantly affected by several factors, such as I/O type (sync or unsync), instruction/job level parallelism, it cannot be accurately modeled via physics laws like mechanical or chemical systems. In this paper, we propose a novel power management scheme called MAR (modeless, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption under performance constraints. By using richer feedback factors, e.g. the I/O wait, MAR is able to accurately describe the relationships among core frequencies, performance and power consumption. We adopt a modeless control model to reduce the complexity of system modeling. MAR is designed for CMP (Chip Multi Processor) systems by employing multi-input/multi-output (MIMO) theory and per core level DVFS (Dynamic Voltage and Frequency Scaling). Our extensive experiments on a physical test bed demonstrate that, for the SPEC benchmark and data-intensive (TPC-C) benchmark, the efficiency of MAR is 93.6-96.2% accurate to the ideal power saving strategy calculated off-line. Compared with baseline solutions, MAR could save 22.5-32.5% more power while keeping the comparable performance loss of about 1.8-2.9%. In addition, simulation results show the efficiency of our design for various CMP configurations.

当今新兴的数据密集型应用程序由不统一的CPU和I/O密集型工作负载组成，因此要求在电源管理策略中同时考虑CPU和I/O影响。仅仅根据处理器的忙/空闲比率来降低处理器的频率并不能充分利用节省电力的机会。我们的实验表明，除了繁忙和空闲状态外，每个处理器还可能有等待I/O操作完成的I/O等待阶段。在此期间，完成时间由I/O子系统而不是CPU决定，因此将处理器扩展到较低的频率不会影响性能，但会节省更多的功率。此外，CPU对I/O操作的反应可能会受到几个因素的显著影响，例如I/O类型(同步或不同步)，指令/作业级别的并行性，它不能通过机械或化学系统等物理定律精确地建模。在本文中，我们提出了一种新的多处理器系统电源管理方案，称为MAR(无模式，自适应，基于规则)，以最大限度地降低CPU功耗。通过使用更丰富的反馈因素，例如I/O等待时间，MAR能够准确地描述核心频率、性能和功耗之间的关系。采用非模态控制模型，降低了系统建模的复杂性。MAR是为CMP(芯片多处理器)系统设计的，采用多输入/多输出(MIMO)理论和每核级DVFS(动态电压和频率缩放)。我们在物理测试平台上的大量实验表明，对于SPEC基准测试和数据密集型(TPC-C)基准测试，MAR的效率与离线计算的理想省电策略的准确率为93.6- 96.2%。与基线解决方案相比，MAR可以节省22.5- 32.5%的功率，同时保持约1.8- 2.9%的性能损失。此外，仿真结果表明了我们的设计在各种CMP配置下的有效性。

{"title":"A Novel Power Management for CMP Systems in Data-Intensive Environment","authors":"Pengju Shang, Jun Wang","doi":"10.1109/IPDPS.2011.19","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.19","url":null,"abstract":"The emerging data-intensive applications of today are comprised of non-uniform CPU and I/O intensive workloads, thus imposing a requirement to consider both CPU and I/O effects in the power management strategies. Only scaling down the processor's frequency based on its busy/idle ratio cannot fully exploit opportunities of saving power. Our experiments show that besides the busy and idle status, each processor may also have I/O wait phases waiting for I/O operations to complete. During this period, the completion time is decided by the I/O subsystem rather than the CPU thus scaling the processor to a lower frequency will not affect the performance but save more power. In addition, the CPU's reaction to the I/O operations may be significantly affected by several factors, such as I/O type (sync or unsync), instruction/job level parallelism, it cannot be accurately modeled via physics laws like mechanical or chemical systems. In this paper, we propose a novel power management scheme called MAR (modeless, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption under performance constraints. By using richer feedback factors, e.g. the I/O wait, MAR is able to accurately describe the relationships among core frequencies, performance and power consumption. We adopt a modeless control model to reduce the complexity of system modeling. MAR is designed for CMP (Chip Multi Processor) systems by employing multi-input/multi-output (MIMO) theory and per core level DVFS (Dynamic Voltage and Frequency Scaling). Our extensive experiments on a physical test bed demonstrate that, for the SPEC benchmark and data-intensive (TPC-C) benchmark, the efficiency of MAR is 93.6-96.2% accurate to the ideal power saving strategy calculated off-line. Compared with baseline solutions, MAR could save 22.5-32.5% more power while keeping the comparable performance loss of about 1.8-2.9%. In addition, simulation results show the efficiency of our design for various CMP configurations.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128240602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds 基于Map-Reduce云上草图绘制和最大拟团枚举的并行宏基因组序列聚类

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.116

X. Yang, J. Zola, S. Aluru

Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.

物种的分类聚类是宏基因组学中一个重要且经常出现的问题。高通量新一代测序有利于创建大型宏基因组样本，但同时由于支持的序列长度较短且样本物种未知，使得聚类问题更加困难。本文提出了一种支持重叠聚类的大型宏基因组样本分层分类聚类并行算法。我们采用了最初为web文档聚类开发的草图技术来推断序列对之间的显著相似性，而无需求助于昂贵的all vs all比对。我们将宏基因组分类问题表述为结果相似图中的最大拟团枚举问题，在不同的相似阈值规定的层次结构的多个级别上。我们将底层算法步骤的执行转换为map-reduce框架的应用程序，以实现基于云的实现。除了解决宏基因组学中的一个重要问题外，这项工作还证明了map-reduce框架在相对复杂的算法设置中的适用性。

{"title":"Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds","authors":"X. Yang, J. Zola, S. Aluru","doi":"10.1109/IPDPS.2011.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.116","url":null,"abstract":"Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129155717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Tolerant Value Speculation in Coarse-Grain Streaming Computations 粗粒度流计算中的容值推测

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.54

Nathaniel Azuelos, I. Keidar, A. Zaks

Streaming applications are the subject of growing interest, as the need for fast access to data continues to grow. In this work, we present the design requirements and implementation of coarse-grain value speculation in streaming applications. We explain how this technique can be useful in cases where serial parts of applications constitute bottlenecks, and when slower I/O favors using available prefixes of the data. Contrary to previous work, we show how allowing some tolerance can justify early predictions on a scale of a large window of values. We suggest a methodology for runtime support of speculation, along with the mechanisms required for rollback. We present resource management issues consequent to our technique. We study how validation and speculation frequencies impact the performance of the program. Finally, we present our implementation in the context of the Huffman encoder benchmark, running it in different configurations and on different architectures.

随着对快速访问数据的需求不断增长，流媒体应用程序越来越受到关注。在这项工作中，我们提出了流应用中粗粒度值推测的设计要求和实现。我们解释了在应用程序的串行部分构成瓶颈的情况下，以及当较慢的I/O倾向于使用可用的数据前缀时，这种技术是如何有用的。与以前的工作相反，我们展示了如何允许一些公差可以证明早期预测在一个大的值窗口范围内是合理的。我们建议在运行时支持推测的方法，以及回滚所需的机制。我们提出了与我们的技术相关的资源管理问题。我们研究了验证和推测频率如何影响程序的性能。最后，我们在Huffman编码器基准测试的背景下展示了我们的实现，在不同的配置和不同的架构上运行它。

引用次数: 1

LACIO: A New Collective I/O Strategy for Parallel I/O Systems LACIO:并行I/O系统的一种新的集体I/O策略

2011 IEEE International Parallel & Distributed Processing Symposium

Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.79

Yong Chen, Xian-He Sun, R. Thakur, P. Roth, W. Gropp

Parallel applications benefit considerably from the rapid advance of processor architectures and the available massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is widely considered a critical solution that exploits the correlation among I/O accesses from multiple processes of a parallel application and optimizes the I/O performance. However, the conventional collective I/O strategy makes the optimization decision based on the logical file layout to avoid multiple file system calls and does not take the physical data layout into consideration. On the other hand, the physical data layout in fact decides the actual I/O access locality and concurrency. In this study, we propose a new collective I/O strategy that is aware of the underlying physical data layout. We confirm that the new Layout-Aware Collective I/O (LACIO) improves the performance of current parallel I/O systems effectively with the help of noncontiguous file system calls. It holds promise in improving the I/O performance for parallel systems.

并行应用程序从处理器体系结构的快速发展和可用的大量计算能力中受益匪浅，但它们的性能受到I/O访问的大延迟的影响。较差的I/O性能被认为是并行系统持续性能较低的一个关键原因。集体I/O被广泛认为是一种关键的解决方案，它利用了来自并行应用程序的多个进程的I/O访问之间的相关性，并优化了I/O性能。然而，传统的集体I/O策略根据逻辑文件布局做出优化决策，以避免多次文件系统调用，而不考虑物理数据布局。另一方面，物理数据布局实际上决定了实际的I/O访问局部性和并发性。在这项研究中，我们提出了一种新的集体I/O策略，它意识到底层物理数据布局。我们证实，新的布局感知集体I/O (LACIO)在不连续文件系统调用的帮助下，有效地提高了当前并行I/O系统的性能。它有望提高并行系统的I/O性能。

{"title":"LACIO: A New Collective I/O Strategy for Parallel I/O Systems","authors":"Yong Chen, Xian-He Sun, R. Thakur, P. Roth, W. Gropp","doi":"10.1109/IPDPS.2011.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.79","url":null,"abstract":"Parallel applications benefit considerably from the rapid advance of processor architectures and the available massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is widely considered a critical solution that exploits the correlation among I/O accesses from multiple processes of a parallel application and optimizes the I/O performance. However, the conventional collective I/O strategy makes the optimization decision based on the logical file layout to avoid multiple file system calls and does not take the physical data layout into consideration. On the other hand, the physical data layout in fact decides the actual I/O access locality and concurrency. In this study, we propose a new collective I/O strategy that is aware of the underlying physical data layout. We confirm that the new Layout-Aware Collective I/O (LACIO) improves the performance of current parallel I/O systems effectively with the help of noncontiguous file system calls. It holds promise in improving the I/O performance for parallel systems.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121294421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41