2010 11th IEEE/ACM International Conference on Grid Computing最新文献_第3页

Reliable workflow execution in distributed systems for cost efficiency 在分布式系统中可靠地执行工作流以提高成本效率

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697959

Young Choon Lee, Albert Y. Zomaya, Mazin S. Yousif

Reliability is of great practical importance in distributed computing systems (DCSs) due to its immediate impact on system performance, i.e., quality of service. The issue of reliability becomes more crucial particularly for ‘cost-conscious’ DCSs like grids and clouds. Unreliability brings about additional—often excessive—capital and operating costs. Resource failures are considered as the main source of unreliability in this study. In this study, we investigate the reliability of workflow execution in the context of scheduling and its effect on operating costs in DCSs, and present the reliability for profit assurance (RPA) algorithm as a novel workflow scheduling heuristic. The proposed RPA algorithm incorporates a (operating) cost-aware replication scheme to increase reliability. The incorporation of cost awareness greatly contributes to efficient replication decisions in terms of profitability. To the best of our knowledge, the work in this paper is the first attempt to explicitly take into account (monetary) reliability cost in workflow scheduling.

可靠性直接影响到系统的性能，即服务质量，因此在分布式计算系统中具有重要的实际意义。可靠性问题变得更加关键，特别是对于像电网和云这样“成本意识强”的dcs。不可靠性会带来额外的——通常是过高的——资金和运营成本。在本研究中，资源失效被认为是不可靠性的主要来源。在本研究中，我们研究了工作流执行的可靠性及其对dcs运行成本的影响，并提出了利润保证可靠性(RPA)算法作为一种新的工作流调度启发式算法。提出的RPA算法采用了一种(操作)成本感知的复制方案来提高可靠性。在盈利能力方面，成本意识的结合极大地有助于有效的复制决策。据我们所知，本文的工作是第一次尝试在工作流调度中明确考虑(货币)可靠性成本。

引用次数: 10

Metrics and task scheduling policies for energy saving in multicore computers 多核计算机节能的度量和任务调度策略

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697984

J. Mair, K. Leung, Z. Huang

In this paper, we have proposed three new metrics, Speedup per Watt (SPW), Power per Speedup (PPS) and Energy per Target (EPT), to guide task schedulers to select the best task schedules for energy saving in multicore computers. Based on these metrics, we have proposed the novel Sharing Policies, the Hare and the Tortoise Policies, which have taken into account parallelism and Dynamic Voltage Frequency Scaling (DVFS) in their schedules. Our experiments show that, on a modern multicore computer, the Hare Policy can save energy up to 72% in a system with low utilization. On a busier system the Sharing Policy can make a saving up to 20% of energy over standard scheduling policies.

在本文中，我们提出了三个新的指标，即每瓦特加速(SPW)，每加速功率(PPS)和每目标能量(EPT)，以指导任务调度程序选择最佳的多核计算机节能任务调度。基于这些指标，我们提出了新的共享策略，野兔和乌龟策略，它们在调度中考虑了并行性和动态电压频率缩放(DVFS)。我们的实验表明，在现代多核计算机上，Hare策略可以在低利用率的系统中节省高达72%的能源。在较繁忙的系统中，共享策略可以比标准调度策略节省多达20%的能源。

引用次数: 19

Parallel simulation and visualization of blood flow in intracranial aneurysms 颅内动脉瘤血流的并行模拟与可视化

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697965

Wolfgang Fenz, J. Dirnberger, C. Watzl, M. Krieger

Our aim is to develop a physically correct simulation of blood flow through intracranial aneurysms. It shall provide means to estimate rupture risks by calculating the distribution of pressure and shear stresses in an intracranial aneurysm, in order to support the planning of clinical interventions. Due to the time-critical nature of the application, we are forced to use the most efficient state-of-the-art numerical methods and technologies together with high performance computing (HPC) infrastructures. The Navier-Stokes equations for the blood flow are discretized via the finite element method (FEM), and the resulting linear equation systems are handled by an algebraic multigrid (AMG) solver. First comparisons of our simulation results with commercial CFD (computational fluid dynamics) software already show good medical relevance for diagnostic decision support. Another challenge is the visualization of our simulation results at acceptable interaction response rates. Physicians require quick and highly interactive visualization of velocity, pressure and stress to be able to assess the rupture risk of an individual vessel morphology. To meet these demands, parallel visualization techniques and high performance computing resources are utilized. In order to provide physicians with access to remote HPC resources which are not available at every hospital, computing infrastructure of the Austrian Grid is utilized for simulation and visualization.

我们的目标是开发一种物理上正确的模拟颅内动脉瘤血流的方法。通过计算颅内动脉瘤内压力和剪应力的分布，提供估计破裂风险的手段，以支持临床干预计划。由于应用程序的时间关键性质，我们被迫使用最有效的最先进的数值方法和技术以及高性能计算(HPC)基础设施。采用有限元法对血流的Navier-Stokes方程进行离散化，得到的线性方程组由代数多网格求解器处理。首先将我们的模拟结果与商业CFD(计算流体动力学)软件进行比较，已经显示出良好的医学诊断决策支持相关性。另一个挑战是在可接受的交互响应率下可视化我们的模拟结果。医生需要快速和高度互动的速度、压力和应力可视化，以便能够评估单个血管形态的破裂风险。为了满足这些需求，需要利用并行可视化技术和高性能计算资源。为了使医生能够访问并非每家医院都提供的远程高性能计算资源，利用奥地利网格的计算基础设施进行模拟和可视化。

{"title":"Parallel simulation and visualization of blood flow in intracranial aneurysms","authors":"Wolfgang Fenz, J. Dirnberger, C. Watzl, M. Krieger","doi":"10.1109/GRID.2010.5697965","DOIUrl":"https://doi.org/10.1109/GRID.2010.5697965","url":null,"abstract":"Our aim is to develop a physically correct simulation of blood flow through intracranial aneurysms. It shall provide means to estimate rupture risks by calculating the distribution of pressure and shear stresses in an intracranial aneurysm, in order to support the planning of clinical interventions. Due to the time-critical nature of the application, we are forced to use the most efficient state-of-the-art numerical methods and technologies together with high performance computing (HPC) infrastructures. The Navier-Stokes equations for the blood flow are discretized via the finite element method (FEM), and the resulting linear equation systems are handled by an algebraic multigrid (AMG) solver. First comparisons of our simulation results with commercial CFD (computational fluid dynamics) software already show good medical relevance for diagnostic decision support. Another challenge is the visualization of our simulation results at acceptable interaction response rates. Physicians require quick and highly interactive visualization of velocity, pressure and stress to be able to assess the rupture risk of an individual vessel morphology. To meet these demands, parallel visualization techniques and high performance computing resources are utilized. In order to provide physicians with access to remote HPC resources which are not available at every hospital, computing infrastructure of the Austrian Grid is utilized for simulation and visualization.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"1 1","pages":"153-160"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80435245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SLA compliance monitoring through semantic processing 通过语义处理进行SLA遵从性监控

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697975

L. Coppolino, D. Mari, L. Romano, V. Vianello

For IT-services providers, user satisfaction is the key for their company's success. Service providers need to understand the requirements of their users and translate them into their own business goals. Service malfunctions could have negative impact on user satisfaction, therefore to detect and resolve failures of the business process level has become a mission critical requirement for any IT-company. Unfortunately, even if a failure manifests itself at the business level, the data describing this failure are scattered into low level components of the system and stored with a formalism incomprehensible to any business analyst. In forensic analysis, the semantic gap between collected data and business analysts' knowledge is closed by the adoption of data-mining and data-warehousing techniques, but such techniques are unsuitable for real-time business process analysis due to their long latencies. The purpose of this paper is to present a framework that allows business process analysts investigating the delivery status of business services in near real-time. The framework requires a first set up phase where domain specialists define ontologies describing low level concepts and the mapping among business events and data gathered into the system, and then it provides business process analysts, aware only of business logics, with a way to investigate service delivery status in near real time. The capability of the framework of processing data in near real time is ensured by the use of emerging technologies such as complex event processing (CEP) engines, which are able to process in real time huge amount of data. Furthermore in the paper, it is also showed a case study from the telecommunication industry aiming to demonstrate the applicability of the framework in a real word scenario.

对于it服务提供商来说，用户满意度是他们公司成功的关键。服务提供者需要了解其用户的需求，并将其转化为自己的业务目标。服务故障可能会对用户满意度产生负面影响，因此检测和解决业务流程级别的故障已成为任何it公司的关键任务需求。不幸的是，即使在业务级别出现故障，描述此故障的数据也分散到系统的低级组件中，并以任何业务分析人员都无法理解的形式进行存储。在取证分析中，通过采用数据挖掘和数据仓库技术来弥补收集到的数据与业务分析人员的知识之间的语义差距，但这些技术由于其较长的延迟而不适合实时业务流程分析。本文的目的是提供一个框架，允许业务流程分析人员近乎实时地调查业务服务的交付状态。该框架需要第一个设置阶段，在此阶段领域专家定义描述低级概念的本体以及业务事件和收集到系统中的数据之间的映射，然后它为仅了解业务逻辑的业务流程分析师提供一种近乎实时地调查服务交付状态的方法。采用复杂事件处理(CEP)引擎等新兴技术，能够实时处理海量数据，保证了该框架近实时处理数据的能力。此外，本文还展示了一个来自电信行业的案例研究，旨在证明该框架在实际场景中的适用性。

{"title":"SLA compliance monitoring through semantic processing","authors":"L. Coppolino, D. Mari, L. Romano, V. Vianello","doi":"10.1109/GRID.2010.5697975","DOIUrl":"https://doi.org/10.1109/GRID.2010.5697975","url":null,"abstract":"For IT-services providers, user satisfaction is the key for their company's success. Service providers need to understand the requirements of their users and translate them into their own business goals. Service malfunctions could have negative impact on user satisfaction, therefore to detect and resolve failures of the business process level has become a mission critical requirement for any IT-company. Unfortunately, even if a failure manifests itself at the business level, the data describing this failure are scattered into low level components of the system and stored with a formalism incomprehensible to any business analyst. In forensic analysis, the semantic gap between collected data and business analysts' knowledge is closed by the adoption of data-mining and data-warehousing techniques, but such techniques are unsuitable for real-time business process analysis due to their long latencies. The purpose of this paper is to present a framework that allows business process analysts investigating the delivery status of business services in near real-time. The framework requires a first set up phase where domain specialists define ontologies describing low level concepts and the mapping among business events and data gathered into the system, and then it provides business process analysts, aware only of business logics, with a way to investigate service delivery status in near real time. The capability of the framework of processing data in near real time is ensured by the use of emerging technologies such as complex event processing (CEP) engines, which are able to process in real time huge amount of data. Furthermore in the paper, it is also showed a case study from the telecommunication industry aiming to demonstrate the applicability of the framework in a real word scenario.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"94 1","pages":"252-258"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73361541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Supporting multi-row distributed transactions with global snapshot isolation using bare-bones HBase 支持多行分布式事务，使用裸骨架HBase实现全局快照隔离

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697970

Chen Zhang, H. Sterck

Snapshot isolation (SI) is an important database transactional isolation level adopted by major database management systems (DBMS). Until now, there is no solution for any traditional DBMS to be easily replicated with global SI for distributed transactions in cloud computing environments. HBase is a column-oriented data store for Hadoop that has been proven to scale and perform well on clouds. HBase features random access performance on par with open source DBMS such as MySQL. However, HBase only provides single atomic row writes based on row locks and very limited transactional support. In this paper, we show how multi-row distributed transactions with global SI guarantee can be easily supported by using bare-bones HBase with its default configuration so that the high throughput, scalability, fault tolerance, access transparency and easy deployability properties of HBase can be inherited. Through performance studies, we quantify the cost of adopting our technique. The contribution of this paper is that we provide a novel approach to use HBase as a cloud database solution with global SI at low added cost. Our approach can be easily extended to other column-oriented data stores.

快照隔离(Snapshot isolation, SI)是主要数据库管理系统(DBMS)采用的一种重要的数据库事务隔离级别。到目前为止，还没有任何传统DBMS可以轻松地与云计算环境中分布式事务的全局SI进行复制的解决方案。HBase是Hadoop的一个面向列的数据存储，它已经被证明在云上具有良好的扩展性和性能。HBase的随机访问性能与开源DBMS(如MySQL)相当。然而，HBase只提供基于行锁的单原子行写和非常有限的事务支持。在本文中，我们展示了如何使用具有默认配置的裸骨架HBase轻松支持具有全局SI保证的多行分布式事务，从而继承HBase的高吞吐量、可扩展性、容错性、访问透明性和易于部署的特性。通过性能研究，我们量化了采用我们技术的成本。本文的贡献在于我们提供了一种新颖的方法，以低附加成本将HBase用作具有全局SI的云数据库解决方案。我们的方法可以很容易地扩展到其他面向列的数据存储。

{"title":"Supporting multi-row distributed transactions with global snapshot isolation using bare-bones HBase","authors":"Chen Zhang, H. Sterck","doi":"10.1109/GRID.2010.5697970","DOIUrl":"https://doi.org/10.1109/GRID.2010.5697970","url":null,"abstract":"Snapshot isolation (SI) is an important database transactional isolation level adopted by major database management systems (DBMS). Until now, there is no solution for any traditional DBMS to be easily replicated with global SI for distributed transactions in cloud computing environments. HBase is a column-oriented data store for Hadoop that has been proven to scale and perform well on clouds. HBase features random access performance on par with open source DBMS such as MySQL. However, HBase only provides single atomic row writes based on row locks and very limited transactional support. In this paper, we show how multi-row distributed transactions with global SI guarantee can be easily supported by using bare-bones HBase with its default configuration so that the high throughput, scalability, fault tolerance, access transparency and easy deployability properties of HBase can be inherited. Through performance studies, we quantify the cost of adopting our technique. The contribution of this paper is that we provide a novel approach to use HBase as a cloud database solution with global SI at low added cost. Our approach can be easily extended to other column-oriented data stores.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"46 1","pages":"177-184"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79738367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Connecting arbitrary data resources to the grid 将任意数据资源连接到网格

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697958

Shunde Zhang, P. Coddington, A. Wendelborn

Many scientific grid systems have been running and serving researchers for many years around the world. Among them, Globus Toolkit and its variants are playing an important role as the basis of most of those existing grid systems. However, the way data is stored and accessed varies. Proprietary protocols have been designed and developed to serve data by different storage systems or file systems. One example is the integrated Rule Oriented Data System (iRODS), which is a data grid system with the non-standard iRODS protocol and has its own client tools and API. Consequently, it is difficult for the grid to connect to it directly and stage data to computers in the grid for processing. It is usually an ad hoc process to transfer data between two data systems with different protocols. In addition, existing data transfer services are mostly designed for the grid and do not understand proprietary protocols. This requires users to transfer data from the source to a temporary space, and then transfer it from the temporary space to the destination, which is complex, inefficient and error-prone. Some work has been done on the client side to address this issue. In order to address the issues of data staging and data transfer in one solution, this paper describes a different but easy and generic approach to connect any data systems to the grid, by providing a service with an abstract framework to convert any underlying data system protocol to the GridFTP protocol, a de facto standard of data transfer for the grid.

许多科学网格系统已经在世界各地运行并为研究人员服务多年。其中，Globus Toolkit及其变体作为大多数现有网格系统的基础发挥着重要作用。但是，存储和访问数据的方式各不相同。专有协议的设计和开发是为了通过不同的存储系统或文件系统提供数据。一个例子是集成的面向规则的数据系统(iRODS)，它是一个使用非标准iRODS协议的数据网格系统，并拥有自己的客户端工具和API。因此，网格很难直接连接到它并将数据提交到网格中的计算机进行处理。在使用不同协议的两个数据系统之间传输数据通常是一个特别的过程。此外，现有的数据传输服务大多是为网格设计的，不理解专有协议。这需要用户将数据从源传输到临时空间，然后再从临时空间传输到目标，这种方式复杂、低效且容易出错。在客户端已经做了一些工作来解决这个问题。为了在一个解决方案中解决数据分段和数据传输的问题，本文描述了一种不同但简单且通用的方法来连接任何数据系统到网格，通过提供一个带有抽象框架的服务将任何底层数据系统协议转换为GridFTP协议，GridFTP协议是网格数据传输的事实上的标准。

{"title":"Connecting arbitrary data resources to the grid","authors":"Shunde Zhang, P. Coddington, A. Wendelborn","doi":"10.1109/GRID.2010.5697958","DOIUrl":"https://doi.org/10.1109/GRID.2010.5697958","url":null,"abstract":"Many scientific grid systems have been running and serving researchers for many years around the world. Among them, Globus Toolkit and its variants are playing an important role as the basis of most of those existing grid systems. However, the way data is stored and accessed varies. Proprietary protocols have been designed and developed to serve data by different storage systems or file systems. One example is the integrated Rule Oriented Data System (iRODS), which is a data grid system with the non-standard iRODS protocol and has its own client tools and API. Consequently, it is difficult for the grid to connect to it directly and stage data to computers in the grid for processing. It is usually an ad hoc process to transfer data between two data systems with different protocols. In addition, existing data transfer services are mostly designed for the grid and do not understand proprietary protocols. This requires users to transfer data from the source to a temporary space, and then transfer it from the temporary space to the destination, which is complex, inefficient and error-prone. Some work has been done on the client side to address this issue. In order to address the issues of data staging and data transfer in one solution, this paper describes a different but easy and generic approach to connect any data systems to the grid, by providing a service with an abstract framework to convert any underlying data system protocol to the GridFTP protocol, a de facto standard of data transfer for the grid.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"35 1","pages":"185-192"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81052056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Analysis and modeling of time-correlated failures in large-scale distributed systems 大型分布式系统中时间相关故障的分析与建模

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697961

N. Yigitbasi, M. Gallet, Derrick Kondo, A. Iosup, D. Epema

The analysis and modeling of the failures bound to occur in today's large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large-scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems.

对当今大规模生产系统中必然发生的故障进行分析和建模，对于提供使这些系统容错且高效所需的理解是非常宝贵的。许多先前的研究在假定失效是相同但独立分布的情况下，没有考虑失效的时变行为。然而，故障之间存在的时间相关性(例如故障率增加的高峰时段)驳斥了这一假设，并可能对容错机制的有效性产生重大影响。例如，如果故障是周期性的或可预测的，则主动容错机制的性能会更有效;类似地，检查点、冗余和调度解决方案的性能取决于故障的频率。本文对大型分布式系统的时变故障行为进行了分析和建模。我们的研究基于19个故障轨迹，这些故障轨迹来自(大部分)生产大规模分布式系统，包括网格、P2P系统、DNS服务器、web服务器和桌面网格。我们首先研究了故障的时间相关性，发现许多研究的痕迹表现出强烈的日模式和高度的自相关性。在此基础上，推导了一个针对实际大规模分布式系统中出现的峰值失效期的模型。我们的模型描述了峰的持续时间、峰间到达时间、峰间故障到达时间和峰间故障持续时间;我们从一组候选分布中确定每个最佳拟合概率分布，并给出(最佳)拟合的参数。最后，我们针对19个实际故障轨迹验证了我们的模型，并发现它所表征的故障平均占这些系统停机时间的50%以上，最高可达95%。

{"title":"Analysis and modeling of time-correlated failures in large-scale distributed systems","authors":"N. Yigitbasi, M. Gallet, Derrick Kondo, A. Iosup, D. Epema","doi":"10.1109/GRID.2010.5697961","DOIUrl":"https://doi.org/10.1109/GRID.2010.5697961","url":null,"abstract":"The analysis and modeling of the failures bound to occur in today's large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large-scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"8 1","pages":"65-72"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84572718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Cost-efficient hosting and load balancing of Massively Multiplayer Online Games 大规模多人在线游戏的成本效益托管和负载平衡

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697956

Vlad Nae, R. Prodan, T. Fahringer

Massively Multiplayer Online Games (MMOG) are a class of computationally-intensive client-server applications with severe real-time Quality of Service (QoS) requirements, such as the number of updates per second each client needs to receive from the servers for a fluent and realistic experience. To guarantee the QoS requirements, game providers over-provision to game sessions a large amount of their resources, which is very inefficient and prohibits any but the largest providers from joining the market. In this paper, we present a new approach for cost-efficient hosting of MMOG sessions on Cloud resources, provisioned on-demand in the correct amount based on the current number of connected players. Simulation results on real MMOG traces demonstrate that compute Clouds can reduce the hosting costs by a factor between two and five. The resource allocation is driven by a load balancing algorithm that appropriately distributes the load such that the QoS requirements are fulfilled at all times. Experimental results on a fast-paced game demonstrator executed on resources owned by a specialised hosting company demonstrate that our algorithm is able to adjust the number of game servers and load distribution to the highly dynamic client load, while maintaining the QoS in 99.34% of the monitored events.

大型多人在线游戏(MMOG)是一类计算密集型的客户机-服务器应用程序，具有严格的实时服务质量(QoS)要求，例如每个客户机每秒需要从服务器接收的更新次数，以获得流畅和真实的体验。为了保证QoS的要求，游戏提供商过度提供大量的资源给游戏会话，这是非常低效的，除了最大的提供商之外，其他任何供应商都无法加入市场。在本文中，我们提出了一种在云资源上经济高效地托管MMOG会话的新方法，根据当前连接的玩家数量按需提供正确的数量。在真实MMOG轨迹上的仿真结果表明，计算云可以将托管成本降低2到5倍。资源分配由负载平衡算法驱动，该算法适当地分配负载，以便始终满足QoS需求。在一家专业托管公司拥有的资源上执行的快节奏游戏演示器上的实验结果表明，我们的算法能够调整游戏服务器的数量和负载分布以适应高度动态的客户端负载，同时在99.34%的监控事件中保持QoS。

{"title":"Cost-efficient hosting and load balancing of Massively Multiplayer Online Games","authors":"Vlad Nae, R. Prodan, T. Fahringer","doi":"10.1109/GRID.2010.5697956","DOIUrl":"https://doi.org/10.1109/GRID.2010.5697956","url":null,"abstract":"Massively Multiplayer Online Games (MMOG) are a class of computationally-intensive client-server applications with severe real-time Quality of Service (QoS) requirements, such as the number of updates per second each client needs to receive from the servers for a fluent and realistic experience. To guarantee the QoS requirements, game providers over-provision to game sessions a large amount of their resources, which is very inefficient and prohibits any but the largest providers from joining the market. In this paper, we present a new approach for cost-efficient hosting of MMOG sessions on Cloud resources, provisioned on-demand in the correct amount based on the current number of connected players. Simulation results on real MMOG traces demonstrate that compute Clouds can reduce the hosting costs by a factor between two and five. The resource allocation is driven by a load balancing algorithm that appropriately distributes the load such that the QoS requirements are fulfilled at all times. Experimental results on a fast-paced game demonstrator executed on resources owned by a specialised hosting company demonstrate that our algorithm is able to adjust the number of game servers and load distribution to the highly dynamic client load, while maintaining the QoS in 99.34% of the monitored events.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"11 1","pages":"9-16"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82605068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Methodology of measurement for energy consumption of applications 应用能源消耗的测量方法

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5697987

Georges Da Costa, H. Hlavacs

For IT systems, energy awareness can be improved in two ways, (i) in a static or (ii) in a dynamic way. The first way leads to building energy efficient hardware that runs fast and consumes only a few watts. The second way consists of reacting to instantaneous power consumption, and of taking decisions that will reduce this consumption.

对于IT系统，能源意识可以通过两种方式提高，(i)以静态方式或(ii)以动态方式。第一种方法是建造运行速度快且只消耗几瓦的节能硬件。第二种方式包括对瞬时电力消耗的反应，并采取将减少这种消耗的决策。

引用次数: 58

Impact of virtual machine granularity on cloud computing workloads performance 虚拟机粒度对云计算工作负载性能的影响

2010 11th IEEE/ACM International Conference on Grid Computing

Pub Date : 2010-10-01 DOI: 10.1109/GRID.2010.5698018

Ping Wang, Wei Huang, Carlos A. Varela

This paper studies the impact of VM granularity on workload performance in cloud computing environments. We use HPL as a representative tightly coupled computational workload and a web server providing content to customers as a representative loosely coupled network intensive workload. The performance evaluation demonstrates VM granularity has a significant impact on the performance of the computational workload. On an 8-CPU machine, the performance obtained from utilizing 8VMs is more than 4 times higher than that given by 4 or 16 VMs for HPL of problem size 4096; whereas on two machines with a total of 12 CPUs 24 VMs gives the best performance for HPL of problem sizes from 256 to 1024. Our results also indicate that the effect of VM granularity on the performance of the web system is not critical. The largest standard deviation of the transaction rates obtained from varying VM granularity is merely 2.89 with a mean value of 21.34. These observations suggest that VM malleability strategies where VM granularity is changed dynamically, can be used to improve the performance of tightly coupled computational workloads, whereas VM consolidation for energy savings can be more effectively applied to loosely coupled network intensive workloads.

本文研究了云计算环境下虚拟机粒度对工作负载性能的影响。我们使用HPL作为紧密耦合计算工作负载的代表，而使用向客户提供内容的web服务器作为松散耦合网络密集型工作负载的代表。性能评估表明，VM粒度对计算工作负载的性能有重大影响。在8个cpu的机器上，对于问题大小为4096的HPL，使用8vm获得的性能比使用4或16 vm获得的性能高4倍以上;而在两台总共有12个cpu的机器上，24个vm为问题大小从256到1024的HPL提供了最佳性能。我们的研究结果还表明，虚拟机粒度对web系统性能的影响并不重要。从不同VM粒度获得的事务率的最大标准偏差仅为2.89，平均值为21.34。这些观察结果表明，VM延展性策略(VM粒度动态变化)可用于提高紧耦合计算工作负载的性能，而VM合并以节省能源可以更有效地应用于松耦合网络密集型工作负载。

{"title":"Impact of virtual machine granularity on cloud computing workloads performance","authors":"Ping Wang, Wei Huang, Carlos A. Varela","doi":"10.1109/GRID.2010.5698018","DOIUrl":"https://doi.org/10.1109/GRID.2010.5698018","url":null,"abstract":"This paper studies the impact of VM granularity on workload performance in cloud computing environments. We use HPL as a representative tightly coupled computational workload and a web server providing content to customers as a representative loosely coupled network intensive workload. The performance evaluation demonstrates VM granularity has a significant impact on the performance of the computational workload. On an 8-CPU machine, the performance obtained from utilizing 8VMs is more than 4 times higher than that given by 4 or 16 VMs for HPL of problem size 4096; whereas on two machines with a total of 12 CPUs 24 VMs gives the best performance for HPL of problem sizes from 256 to 1024. Our results also indicate that the effect of VM granularity on the performance of the web system is not critical. The largest standard deviation of the transaction rates obtained from varying VM granularity is merely 2.89 with a mean value of 21.34. These observations suggest that VM malleability strategies where VM granularity is changed dynamically, can be used to improve the performance of tightly coupled computational workloads, whereas VM consolidation for energy savings can be more effectively applied to loosely coupled network intensive workloads.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"45 1","pages":"393-400"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80734362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23