2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)最新文献_第4页

OS-Based NUMA Optimization: Tackling the Case of Truly Multi-thread Applications with Non-partitioned Virtual Page Accesses 基于操作系统的NUMA优化:处理具有非分区虚拟页面访问的真正多线程应用程序的情况

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.91

Ilaria Di Gennaro, Alessandro Pellegrini, F. Quaglia

A common approach to improve memory access in NUMA machines exploits operating system (OS) page protection mechanisms to induce faults to determine which pages are accessed by what thread, so as to move the thread and its working-set of pages to the same NUMA node. However, existing proposals do not fully fit the requirements of truly multi-thread applications with non-partitioned accesses to virtual pages. In fact, these proposals exploit (induced) faults on a same page-table for all the threads of a same process to determine the access pattern. Hence, the fault by one thread (and the consequent re-opening of the access to the corresponding page) would mask those by other threads on the same page. This may lead to inaccuracy in the estimation of the working-set of individual threads. We overcome this drawback by presenting a lightweight operating system support for Linux, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses, and an associated thread/data migration policy. Our solution is fully transparent to user-space code. It is embedded in a Linux/x86_64 module that installs any required modification to the original kernel image by solely relying on dynamic patching. A motivated case study in the context of HPC is also presented for an assessment of our proposal.

改进NUMA机器中内存访问的一种常用方法是利用操作系统(OS)页面保护机制来诱导故障，以确定哪个线程访问哪些页面，从而将线程及其工作页面集移动到相同的NUMA节点。然而，现有的建议并不能完全满足对虚拟页面进行非分区访问的真正多线程应用程序的需求。实际上，这些建议利用(诱发的)同一页表中同一进程的所有线程的错误来确定访问模式。因此，一个线程的错误(以及随后重新打开对相应页面的访问)将掩盖同一页面上其他线程的错误。这可能导致对单个线程工作集的估计不准确。我们通过为Linux提供轻量级操作系统支持(称为多视图地址空间)来克服这一缺点，它明确地针对具有非分区访问的真正多线程应用程序中每个线程工作集估计的准确性，以及相关的线程/数据迁移策略。我们的解决方案对用户空间代码是完全透明的。它嵌入在Linux/x86_64模块中，该模块仅依靠动态补丁就可以对原始内核映像安装任何必需的修改。在HPC的背景下，还提出了一个有动机的案例研究，以评估我们的建议。

{"title":"OS-Based NUMA Optimization: Tackling the Case of Truly Multi-thread Applications with Non-partitioned Virtual Page Accesses","authors":"Ilaria Di Gennaro, Alessandro Pellegrini, F. Quaglia","doi":"10.1109/CCGrid.2016.91","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.91","url":null,"abstract":"A common approach to improve memory access in NUMA machines exploits operating system (OS) page protection mechanisms to induce faults to determine which pages are accessed by what thread, so as to move the thread and its working-set of pages to the same NUMA node. However, existing proposals do not fully fit the requirements of truly multi-thread applications with non-partitioned accesses to virtual pages. In fact, these proposals exploit (induced) faults on a same page-table for all the threads of a same process to determine the access pattern. Hence, the fault by one thread (and the consequent re-opening of the access to the corresponding page) would mask those by other threads on the same page. This may lead to inaccuracy in the estimation of the working-set of individual threads. We overcome this drawback by presenting a lightweight operating system support for Linux, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses, and an associated thread/data migration policy. Our solution is fully transparent to user-space code. It is embedded in a Linux/x86_64 module that installs any required modification to the original kernel image by solely relying on dynamic patching. A motivated case study in the context of HPC is also presented for an assessment of our proposal.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122272154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors 基于ia的多/多核处理器的快速并行流压缩

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.112

Qiao Sun, Chao Yang, Changmao Wu, Leisheng Li, Fangfang Liu

Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.

流压缩经常出现在各种各样的应用程序中，它作为一种通用的原语，将输入流减少到只包含所需元素的子集，以便能够有效地完成后续计算。在本文中，我们提出了一种基于ia的多核/多核处理器的快速并行流压缩。与之前研究的算法严重依赖于黑盒并行扫描不同，我们在提出的算法中打开黑盒并手动裁剪它，从而显着降低了工作负载和内存占用。通过进一步消除条件语句并为性能关键型内核应用自动代码生成/优化，所提出的并行流压缩在不同情况下以及跨不同基于ia的多/多核平台的各种数据类型中实现了高性能。在四核Core-i7 CPU、双插槽8核Xeon CPU和61核Xeon Phi加速器等三种典型的基于ia的处理器上的实验结果表明，所提出的实现优于最先进库Thrust中的参考并行处理器。在上面的基础上，我们将其应用于基于随机森林的数据分类器中，以显示其提高实际应用程序性能的潜力。

{"title":"Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors","authors":"Qiao Sun, Chao Yang, Changmao Wu, Leisheng Li, Fangfang Liu","doi":"10.1109/CCGrid.2016.112","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.112","url":null,"abstract":"Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117274171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

iGiraph: A Cost-Efficient Framework for Processing Large-Scale Graphs on Public Clouds igirgraph:在公共云上处理大规模图形的成本效益框架

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.38

Safiollah Heidari, R. Calheiros, R. Buyya

Large-scale graph analytics has gained attention during the past few years. As the world is going to be more connected by appearance of new technologies and applications such as social networks, Web portals, mobile devices, Internet of things, etc, a huge amount of data are created and stored every day in the form of graphs consisting of billions of vertices and edges. Many graph processing frameworks have been developed to process these large graphs since Google introduced its graph processing framework called Pregel in 2010. On the other hand, cloud computing which is a new paradigm of computing that overcomes restrictions of traditional problems in computing by enabling some novel technological and economical solutions such as distributed computing, elasticity and pay-as-you-go models has improved service delivery features. In this paper, we present iGiraph, a cost-efficient Pregel-like graph processing framework for processing large-scale graphs on public clouds. iGiraph uses a new dynamic re-partitioning approach based on messaging pattern to minimize the cost of resource utilization on public clouds. We also present the experimental results on the performance and cost effects of our method and compare them with basic Giraph framework. Our results validate that iGiraph remarkably decreases the cost and improves the performance by scaling the number of workers dynamically.

大规模图分析在过去几年中引起了人们的关注。随着社交网络、门户网站、移动设备、物联网等新技术和应用的出现，世界将更加紧密地联系在一起，每天都有大量的数据以由数十亿个顶点和边组成的图形的形式被创建和存储。自2010年谷歌推出名为Pregel的图形处理框架以来，已经开发了许多图形处理框架来处理这些大型图形。另一方面，云计算作为一种新的计算范式，通过支持一些新颖的技术和经济解决方案(如分布式计算、弹性和按需付费模型)，克服了传统计算问题的限制，改进了服务交付特性。在本文中，我们提出了igirgraph，这是一个具有成本效益的类似pregel的图形处理框架，用于处理公共云上的大规模图形。igirgraph使用一种新的基于消息传递模式的动态重新分区方法来最小化公共云上的资源利用成本。我们还给出了我们的方法的性能和成本效应的实验结果，并将它们与基本的Giraph框架进行了比较。我们的结果验证了igirgraph通过动态扩展工作人员的数量显著降低了成本并提高了性能。

{"title":"iGiraph: A Cost-Efficient Framework for Processing Large-Scale Graphs on Public Clouds","authors":"Safiollah Heidari, R. Calheiros, R. Buyya","doi":"10.1109/CCGrid.2016.38","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.38","url":null,"abstract":"Large-scale graph analytics has gained attention during the past few years. As the world is going to be more connected by appearance of new technologies and applications such as social networks, Web portals, mobile devices, Internet of things, etc, a huge amount of data are created and stored every day in the form of graphs consisting of billions of vertices and edges. Many graph processing frameworks have been developed to process these large graphs since Google introduced its graph processing framework called Pregel in 2010. On the other hand, cloud computing which is a new paradigm of computing that overcomes restrictions of traditional problems in computing by enabling some novel technological and economical solutions such as distributed computing, elasticity and pay-as-you-go models has improved service delivery features. In this paper, we present iGiraph, a cost-efficient Pregel-like graph processing framework for processing large-scale graphs on public clouds. iGiraph uses a new dynamic re-partitioning approach based on messaging pattern to minimize the cost of resource utilization on public clouds. We also present the experimental results on the performance and cost effects of our method and compare them with basic Giraph framework. Our results validate that iGiraph remarkably decreases the cost and improves the performance by scaling the number of workers dynamically.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"50 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120934115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

CACPPA: A Cloud-Assisted Conditional Privacy Preserving Authentication Protocol for VANET 一种云辅助的VANET条件隐私保护认证协议

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.47

Ubaidullah Rajput, Fizza Abbas, Jian Wang, Hasoo Eun, Heekuck Oh

Vehicular ad hoc network (VANET) is an application of intelligent transportation system (ITS) with emphasis on improving traffic safety as well as efficiency. VANET can be thought as a subset of mobile ad hoc network (MANET) where vehicles form a network by communicating with each other (V2V) or with infrastructure (V2I). Vehicles broadcast not only traffic messages but also safety critical messages such as electronic emergency braking light (EEBL). A misuse of this application may result in a traffic accident and loss of life at worse. This situation makes vehicles' authentication a necessary requirement in VANET. During authentication, vehicle's privacy related data such as vehicle and owner's identity and location information should be kept private in order to prevent an attacker from stealing this information. This paper presents a cloud-assisted conditional privacy preserving authentication (CACPPA) protocol for VANET. CACPPA is a hybrid approach that utilizes both the concept of pseudonym-based approaches and group-signaturebased approaches but cleverly avoids the inherent drawbacks of these approaches. CACPPA neither requires a vehicle to manage a certificate revocation list nor does it require vehicle to manage any groups. In fact an efficient cloud-based certification authority is used to assist vehicles getting credentials and subsequently using them during authentication. CACPPA provides conditional anonymity that a vehicle's anonymity preserved only until it honestly follows the protocol. Furthermore, we analyze CACPPA with various attack scenarios, present a computational and communication cost analysis as well as comparison with existing approaches to show its feasibility and robustness.

车辆自组织网络(VANET)是智能交通系统(ITS)的一种应用，其重点是提高交通安全和效率。VANET可以被认为是移动自组织网络(MANET)的一个子集，在MANET中，车辆通过相互通信(V2V)或与基础设施(V2I)形成网络。车辆不仅可以广播交通信息，还可以广播电子紧急制动灯(EEBL)等安全关键信息。误用此应用程序可能导致交通事故，更严重的是生命损失。这种情况使得车辆认证成为VANET的必要要求。在身份验证过程中，车辆和车主的身份、位置信息等与车辆隐私相关的数据需要保密，以防止攻击者窃取这些信息。提出了一种用于VANET的云辅助条件隐私保护认证(CACPPA)协议。CACPPA是一种混合方法，它利用了基于假名的方法和基于组签名的方法的概念，但巧妙地避免了这些方法固有的缺点。CACPPA既不要求车辆管理证书撤销列表，也不要求车辆管理任何组。事实上，一个高效的基于云的认证中心被用来帮助车辆获取凭证，并随后在身份验证期间使用它们。CACPPA提供了有条件的匿名，车辆的匿名性只有在它诚实地遵守协议之前才能保留。此外，我们分析了CACPPA在各种攻击场景下的情况，并进行了计算和通信成本分析，并与现有方法进行了比较，以证明其可行性和鲁棒性。

{"title":"CACPPA: A Cloud-Assisted Conditional Privacy Preserving Authentication Protocol for VANET","authors":"Ubaidullah Rajput, Fizza Abbas, Jian Wang, Hasoo Eun, Heekuck Oh","doi":"10.1109/CCGrid.2016.47","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.47","url":null,"abstract":"Vehicular ad hoc network (VANET) is an application of intelligent transportation system (ITS) with emphasis on improving traffic safety as well as efficiency. VANET can be thought as a subset of mobile ad hoc network (MANET) where vehicles form a network by communicating with each other (V2V) or with infrastructure (V2I). Vehicles broadcast not only traffic messages but also safety critical messages such as electronic emergency braking light (EEBL). A misuse of this application may result in a traffic accident and loss of life at worse. This situation makes vehicles' authentication a necessary requirement in VANET. During authentication, vehicle's privacy related data such as vehicle and owner's identity and location information should be kept private in order to prevent an attacker from stealing this information. This paper presents a cloud-assisted conditional privacy preserving authentication (CACPPA) protocol for VANET. CACPPA is a hybrid approach that utilizes both the concept of pseudonym-based approaches and group-signaturebased approaches but cleverly avoids the inherent drawbacks of these approaches. CACPPA neither requires a vehicle to manage a certificate revocation list nor does it require vehicle to manage any groups. In fact an efficient cloud-based certification authority is used to assist vehicles getting credentials and subsequently using them during authentication. CACPPA provides conditional anonymity that a vehicle's anonymity preserved only until it honestly follows the protocol. Furthermore, we analyze CACPPA with various attack scenarios, present a computational and communication cost analysis as well as comparison with existing approaches to show its feasibility and robustness.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121481422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Evaluation of In-Situ Analysis Strategies at Scale for Power Efficiency and Scalability 基于功率效率和可扩展性的原位分析策略评估

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.95

I. Rodero, M. Parashar, Aaditya G. Landge, Sidharth Kumar, Valerio Pascucci, P. Bremer

The increasing gap between available compute power and I/O capabilities is resulting in simulation pipelines running on leadership computing facilities being reformulated. In particular, in-situ processing is complementing conventional post-process analysis, however, it can be performed by using the same compute resources as the simulation or using secondary dedicated resources. In this paper, we focus on three different in-situ analysis strategies, which use the same compute resources as the ongoing simulation but different data movement strategies. We evaluate the costs incurred by these strategies in terms of run time, scalability and power/energy consumption. Furthermore, we extrapolate power behavior to peta-scale and investigate different design choices through projections. Experimental evaluation at full machine scale on Titan supports that using fewer cores per node for in-situ analysis is the optimum choice in terms of scalability. Hence, further research effort should be devoted towards developing in-situ analysis techniques following this strategy in future high-end systems.

可用计算能力和I/O能力之间的差距越来越大，导致在领先计算设施上运行的模拟管道被重新制定。特别是，原位处理是对传统的后处理分析的补充，然而，它可以通过使用与模拟相同的计算资源或使用辅助专用资源来执行。在本文中，我们重点研究了三种不同的原位分析策略，它们使用与正在进行的模拟相同的计算资源，但使用不同的数据移动策略。我们根据运行时间、可伸缩性和功耗/能耗来评估这些策略所产生的成本。此外，我们将功率行为外推到千兆级，并通过投影研究不同的设计选择。在Titan上进行整机规模的实验评估表明，就可扩展性而言，使用更少的每个节点内核进行原位分析是最佳选择。因此，进一步的研究工作应该致力于在未来的高端系统中开发基于该策略的原位分析技术。

{"title":"Evaluation of In-Situ Analysis Strategies at Scale for Power Efficiency and Scalability","authors":"I. Rodero, M. Parashar, Aaditya G. Landge, Sidharth Kumar, Valerio Pascucci, P. Bremer","doi":"10.1109/CCGrid.2016.95","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.95","url":null,"abstract":"The increasing gap between available compute power and I/O capabilities is resulting in simulation pipelines running on leadership computing facilities being reformulated. In particular, in-situ processing is complementing conventional post-process analysis, however, it can be performed by using the same compute resources as the simulation or using secondary dedicated resources. In this paper, we focus on three different in-situ analysis strategies, which use the same compute resources as the ongoing simulation but different data movement strategies. We evaluate the costs incurred by these strategies in terms of run time, scalability and power/energy consumption. Furthermore, we extrapolate power behavior to peta-scale and investigate different design choices through projections. Experimental evaluation at full machine scale on Titan supports that using fewer cores per node for in-situ analysis is the optimum choice in terms of scalability. Hence, further research effort should be devoted towards developing in-situ analysis techniques following this strategy in future high-end systems.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122469522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Billing system CPU time on individual VM 计费系统在单个虚拟机上的CPU时间

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.76

Boris Teabe, A. Tchana, D. Hagimont

In virtualized cloud hosting centers, a virtual machine (VM) is generally allocated a fixed computing capacity. The virtualization system schedules the VMs and guarantees that each VM capacity is provided and respected. However, a significant amount of CPU time is consumed by the underlying virtualization system, which generally includes device drivers (mainly network and disk drivers). In today's virtualization systems, this CPU time consumed is difficult to monitor and it is not charged to VMs. Such a situation can have important consequences for both clients and provider: performance isolation and predictability for the former and resource management (and especially consolidation) for the latter. In this paper, we propose a virtualization system mechanism which allows estimating the CPU time used by the virtualization system on behalf of VMs. Subsequently, this CPU time is charged to VMs, thus removing the two previous side effects. This mechanism has been implemented in Xen. Its benefits have been evaluated using reference benchmarks.

在虚拟化的云托管中心中，通常为虚拟机分配固定的计算容量。虚拟化系统对虚拟机进行调度，保证每个虚拟机的容量都被提供和尊重。但是，底层虚拟化系统消耗了大量CPU时间，其中通常包括设备驱动程序(主要是网络和磁盘驱动程序)。在当今的虚拟化系统中，这种CPU时间消耗很难监控，并且不会将其计入虚拟机。这种情况可能对客户机和提供者都产生重要影响:前者是性能隔离和可预测性，后者是资源管理(尤其是整合)。在本文中，我们提出了一种虚拟化系统机制，该机制允许估计虚拟化系统代表vm使用的CPU时间。随后，这些CPU时间被分配给vm，从而消除了前面的两个副作用。这种机制已经在Xen中实现了。它的好处已经使用参考基准进行了评估。

引用次数: 6

Demand-Aware Power Management for Power-Constrained HPC Systems 功率受限HPC系统的需求感知电源管理

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.25

Cao Thang, Yuan He, Masaaki Kondo

As limited power budget is becoming one of the most crucialchallenges in developing supercomputer systems, hardware overprovisioning which installs larger number of nodes beyond the limitations of the power constraint determinedby Thermal Design Power is an attractive way to design extreme-scale supercomputers. In this design, power consumption of each node should be controlled by power-knobs equipped in the hardware such as dynamic voltage and frequency scaling (DVFS) or power capping mechanisms. Traditionally, in supercomputer systems, schedulers determine when and where to allocate jobs. In overprovisioned systems, the schedulers also need to care about power allocation to each job. An easy way is to set a fixed power cap for each job so that the total power consumption is within the power constraint of the system. This fixed power capping does not necessarily provide good performance since the effective power usage of jobs changes throughout their execution. Moreover, because each job has its own performance requirement, fixed power cap may not work well for all the jobs. In this paper, we propose a demand-aware power management framework for overprovisioned and power-constrained high-performance computing (HPC) systems. The job scheduler selects a job to run based on available hardware and power resources. The power manager continuously monitors power usage, predicts performance of executing jobs and optimizes power cap of each CPU so that the required performance level of each job is satisfied while improving system throughput by making good use of available powerbudget. Experiments on a real HPC system and with simulation for a large scale system show that the power manager can successfully control power consumption of executing jobs while achieving 1.17x improvement in system throughput.

由于有限的功耗预算成为开发超级计算机系统中最关键的挑战之一，在热设计功率限制的限制下安装更多节点的硬件过量配置是设计超大规模超级计算机的一种有吸引力的方法。在本设计中，每个节点的功耗应通过硬件中配备的功率旋钮进行控制，如动态电压频率缩放(DVFS)或功率封顶机制。传统上，在超级计算机系统中，调度程序决定何时何地分配作业。在供应过剩的系统中，调度器还需要关心每个作业的功率分配。一种简单的方法是为每个作业设置一个固定的功率上限，使总功耗在系统的功率约束范围内。这种固定的功率上限不一定提供良好的性能，因为作业的有效功率使用在整个执行过程中都会发生变化。此外，由于每个工作都有自己的性能要求，固定的功率上限可能不适用于所有工作。在本文中，我们提出了一个需求感知电源管理框架，用于供应过剩和功率受限的高性能计算(HPC)系统。作业调度器根据可用的硬件和电源资源选择要运行的作业。电源管理器持续监控电源使用情况，预测执行作业的性能，并优化每个CPU的功率上限，以便满足每个作业所需的性能水平，同时通过充分利用可用的功率预算来提高系统吞吐量。在实际高性能计算系统上的实验和大型系统的仿真表明，电源管理器可以成功地控制作业执行的功耗，同时使系统吞吐量提高1.17倍。

{"title":"Demand-Aware Power Management for Power-Constrained HPC Systems","authors":"Cao Thang, Yuan He, Masaaki Kondo","doi":"10.1109/CCGrid.2016.25","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.25","url":null,"abstract":"As limited power budget is becoming one of the most crucialchallenges in developing supercomputer systems, hardware overprovisioning which installs larger number of nodes beyond the limitations of the power constraint determinedby Thermal Design Power is an attractive way to design extreme-scale supercomputers. In this design, power consumption of each node should be controlled by power-knobs equipped in the hardware such as dynamic voltage and frequency scaling (DVFS) or power capping mechanisms. Traditionally, in supercomputer systems, schedulers determine when and where to allocate jobs. In overprovisioned systems, the schedulers also need to care about power allocation to each job. An easy way is to set a fixed power cap for each job so that the total power consumption is within the power constraint of the system. This fixed power capping does not necessarily provide good performance since the effective power usage of jobs changes throughout their execution. Moreover, because each job has its own performance requirement, fixed power cap may not work well for all the jobs. In this paper, we propose a demand-aware power management framework for overprovisioned and power-constrained high-performance computing (HPC) systems. The job scheduler selects a job to run based on available hardware and power resources. The power manager continuously monitors power usage, predicts performance of executing jobs and optimizes power cap of each CPU so that the required performance level of each job is satisfied while improving system throughput by making good use of available powerbudget. Experiments on a real HPC system and with simulation for a large scale system show that the power manager can successfully control power consumption of executing jobs while achieving 1.17x improvement in system throughput.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126678485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

HPC Job Mapping over Reconfigurable Wireless Links 基于可重构无线链路的HPC作业映射

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.17

Yao Hu, I. Fujiwara, M. Koibuchi

Wireless supercomputers and datacenters with 60GHz radio or free-space optics (FSO) have been proposed so that a diverse application workload can be better supported by changing network topologies by swapping the endpoints of wireless links. In this study we proposed the use of such wireless links for the purpose of improving job mapping. We investigated various trade-offs of the number of wireless links, time overhead of wireless link reconfiguration, topology embedding and job sizes. Our simulation results demonstrate that the wired job mapping heavily degrades the system utilization of supercomputers and datacenters under a conventional fixed network topology. By contrast, wireless interconnection networks can have an ideal job mapping by directly reconnecting non-neighboring computing nodes. It improves the system utilization by up to 17.7% for user jobs on a supercomputer and thus can shorten the whole service time especially for dealing with dozens of intensively incoming jobs. Furthermore, we confirmed that either workload or scheduling policy does not impact the fact that the ideal job mapping on wireless supercomputers outperforms that on wired networks in terms of system utilization and whole service time. Finally, our evaluation shows that a constrained and reasonable more use of partial wireless links can achieve shorter queuing length and time.

采用60GHz无线电或自由空间光学(FSO)的无线超级计算机和数据中心已经被提出，以便通过交换无线链路的端点来改变网络拓扑结构，从而更好地支持各种应用工作负载。在这项研究中，我们建议使用这种无线链路来改善工作映射。我们研究了无线链路数量、无线链路重构的时间开销、拓扑嵌入和作业大小的各种权衡。仿真结果表明，在传统的固定网络拓扑下，有线作业映射严重降低了超级计算机和数据中心的系统利用率。相比之下，无线互连网络可以通过直接重新连接非相邻的计算节点来实现理想的作业映射。对于超级计算机上的用户作业，它将系统利用率提高了17.7%，因此可以缩短整个服务时间，特别是在处理数十个密集传入作业时。此外，我们证实，工作负载或调度策略都不会影响无线超级计算机上的理想作业映射在系统利用率和整个服务时间方面优于有线网络的事实。最后，我们的评估表明，有约束和合理地更多地使用部分无线链路可以实现更短的排队长度和时间。

{"title":"HPC Job Mapping over Reconfigurable Wireless Links","authors":"Yao Hu, I. Fujiwara, M. Koibuchi","doi":"10.1109/CCGrid.2016.17","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.17","url":null,"abstract":"Wireless supercomputers and datacenters with 60GHz radio or free-space optics (FSO) have been proposed so that a diverse application workload can be better supported by changing network topologies by swapping the endpoints of wireless links. In this study we proposed the use of such wireless links for the purpose of improving job mapping. We investigated various trade-offs of the number of wireless links, time overhead of wireless link reconfiguration, topology embedding and job sizes. Our simulation results demonstrate that the wired job mapping heavily degrades the system utilization of supercomputers and datacenters under a conventional fixed network topology. By contrast, wireless interconnection networks can have an ideal job mapping by directly reconnecting non-neighboring computing nodes. It improves the system utilization by up to 17.7% for user jobs on a supercomputer and thus can shorten the whole service time especially for dealing with dozens of intensively incoming jobs. Furthermore, we confirmed that either workload or scheduling policy does not impact the fact that the ideal job mapping on wireless supercomputers outperforms that on wired networks in terms of system utilization and whole service time. Finally, our evaluation shows that a constrained and reasonable more use of partial wireless links can achieve shorter queuing length and time.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122933584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters DiBA:大规模计算集群的分布式电力预算分配

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.101

Masoud Badiei, Xin Zhan, R. Azimi, S. Reda, Na Li

Power management has become a central issue inlarge-scale computing clusters where a considerable amount ofenergy is consumed and a large operational cost is incurredannually. Traditional power management techniques have a centralizeddesign that creates challenges for scalability of computingclusters. In this work, we develop a framework for distributedpower budget allocation that maximizes the utility of computingnodes subject to a total power budget constraint. To eliminate the role of central coordinator in the primaldualtechnique, we propose a distributed power budget allocationalgorithm (DiBA) which maximizes the combined performanceof a cluster subject to a power budget constraint in a distributedfashion. Specifically, DiBA is a consensus-based algorithm inwhich each server determines its optimal power consumptionlocally by communicating its state with neighbors (connectednodes) in a cluster. We characterize a synchronous primal-dualtechnique to obtain a benchmark for comparison with thedistributed algorithm that we propose. We demonstrate numericallythat DiBA is a scalable algorithm that outperforms theconventional primal-dual method on large scale clusters in termsof convergence time. Further, DiBA eliminates the communicationbottleneck in the primal-dual method. We thoroughly evaluatethe characteristics of DiBA through simulations of large-scaleclusters. Furthermore, we provide results from a proof-of-conceptimplementation on a real experimental cluster.

在大规模计算集群中，电源管理已经成为一个中心问题，每年都要消耗大量的能源和产生大量的运营成本。传统的电源管理技术采用集中式设计，这给计算集群的可扩展性带来了挑战。在这项工作中，我们开发了一个分布式功率预算分配框架，该框架可以最大限度地利用受总功率预算约束的计算节点。为了消除原始对偶技术中中心协调器的作用，我们提出了一种分布式功率预算分配算法(DiBA)，该算法以分布式方式最大化受功率预算约束的集群的综合性能。具体来说，DiBA是一种基于共识的算法，其中每个服务器通过与集群中的邻居(连接节点)通信其状态来确定其本地最优功耗。我们描述了一种同步原始-双技术，以获得与我们提出的分布式算法进行比较的基准。我们在数值上证明了DiBA是一种可扩展的算法，在收敛时间方面优于传统的原始对偶方法。此外，DiBA消除了原对偶方法中的通信瓶颈。我们通过大规模的群集模拟来全面评估DiBA的特性。此外，我们在一个真实的实验集群上提供了概念验证的结果。

{"title":"DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters","authors":"Masoud Badiei, Xin Zhan, R. Azimi, S. Reda, Na Li","doi":"10.1109/CCGrid.2016.101","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.101","url":null,"abstract":"Power management has become a central issue inlarge-scale computing clusters where a considerable amount ofenergy is consumed and a large operational cost is incurredannually. Traditional power management techniques have a centralizeddesign that creates challenges for scalability of computingclusters. In this work, we develop a framework for distributedpower budget allocation that maximizes the utility of computingnodes subject to a total power budget constraint. To eliminate the role of central coordinator in the primaldualtechnique, we propose a distributed power budget allocationalgorithm (DiBA) which maximizes the combined performanceof a cluster subject to a power budget constraint in a distributedfashion. Specifically, DiBA is a consensus-based algorithm inwhich each server determines its optimal power consumptionlocally by communicating its state with neighbors (connectednodes) in a cluster. We characterize a synchronous primal-dualtechnique to obtain a benchmark for comparison with thedistributed algorithm that we propose. We demonstrate numericallythat DiBA is a scalable algorithm that outperforms theconventional primal-dual method on large scale clusters in termsof convergence time. Further, DiBA eliminates the communicationbottleneck in the primal-dual method. We thoroughly evaluatethe characteristics of DiBA through simulations of large-scaleclusters. Furthermore, we provide results from a proof-of-conceptimplementation on a real experimental cluster.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114582776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

DTStorage: Dynamic Tape-Based Storage for Cost-Effective and Highly-Available Streaming Service DTStorage:用于高性价比和高可用性流媒体服务的动态磁带存储

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.43

Jaewon Lee, Jaehyung Ahn, Choongul Park, Jangwoo Kim

An online streaming service is a gateway providing highly accessible data to end-users. However, the rapid growth of digital data and the corresponding user accesses increase the storage costs and management burden of the service providers. In particular, the accumulation of rarely accessed cold data and the time-varying and skewed data accesses are the two major problems degrading the efficiency as well as throughput of modern streaming services. In this paper, we propose DTStorage, a dynamic tape-based storage system for cost-effective and highly-available streaming services. DTStorage significantly reduces the storage costs by keeping latency-insensitive cold data in cost-effective tape storages, and achieves high throughput by adaptively balancing the data availability with its contention-aware replica management policy. Our prototype evaluated in collaboration with KT, the largest multimedia service company in Korea, reduces the storage costs by up to 45% while satisfying the target performance of a real-world smart TV streaming workload.

在线流媒体服务是向最终用户提供高度可访问数据的网关。然而，数字数据的快速增长和相应的用户访问增加了服务提供商的存储成本和管理负担。特别是冷数据的积累和时变和数据访问的倾斜是影响现代流媒体服务效率和吞吐量的两个主要问题。在本文中，我们提出了DTStorage，一种基于磁带的动态存储系统，用于高成本效益和高可用性的流媒体服务。DTStorage通过将对延迟不敏感的冷数据保存在具有成本效益的磁带存储中，大大降低了存储成本，并通过其竞争感知的副本管理策略自适应地平衡数据可用性，实现了高吞吐量。我们与韩国最大的多媒体服务公司KT合作评估的原型降低了高达45%的存储成本，同时满足了现实世界智能电视流媒体工作负载的目标性能。

引用次数: 4