计算网格/集群环境的可靠性感知资源管理

The 6th IEEE/ACM International Workshop on Grid Computing, 2005. Pub Date : 2005-11-13 DOI:10.1109/GRID.2005.1542744

K. Limaye, C. Leangsuksun, Yudan Liu, Z. Greenwood, S. Scott, Richard Libby, K. Chanchio

{"title":"计算网格/集群环境的可靠性感知资源管理","authors":"K. Limaye, C. Leangsuksun, Yudan Liu, Z. Greenwood, S. Scott, Richard Libby, K. Chanchio","doi":"10.1109/GRID.2005.1542744","DOIUrl":null,"url":null,"abstract":"The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.","PeriodicalId":347929,"journal":{"name":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","volume":"81 1-2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Reliability-aware resource management for computational grid/cluster environments\",\"authors\":\"K. Limaye, C. Leangsuksun, Yudan Liu, Z. Greenwood, S. Scott, Richard Libby, K. Chanchio\",\"doi\":\"10.1109/GRID.2005.1542744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.\",\"PeriodicalId\":347929,\"journal\":{\"name\":\"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.\",\"volume\":\"81 1-2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GRID.2005.1542744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2005.1542744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

通过网格计算实现的集体资源利用对协作社区的整体计算能力至关重要，应予以保证。特别是，在作业站点是Beowulf集群系统的现有环境中，服务节点故障可能导致整个系统中断。目前的网格容错技术只是以一种机会主义的方式来解决这些问题。因此，有必要通过在作业站点级别主动处理故障来补充这些方法，确保系统的高可用性，而不会丢失用户提交的作业。我们的网格感知集群资源管理工作的动机是这样一个事实:集群在计算网格环境中变成了一个流行的工作站点。我们提出了一种在服务级别处理容错的解决方案，作为对最近一些研究中基于任务的解决方案的补充。我们讨论了与网格相关的各种服务可用性问题，以及在实现智能故障转移和透明作业队列复制机制以及自动化网格安装包时获得的初步结果。在实现我们的概念验证框架之后，我们的报告带来的好处超过了可接受的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Reliability-aware resource management for computational grid/cluster environments

The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The 6th IEEE/ACM International Workshop on Grid Computing, 2005.

自引率

0.00%

发文量

期刊最新文献

Generic application description model: toward automatic deployment of applications on computational grids Web services and grid security vulnerabilities and threats analysis and model A semantic datagrid for combinatorial chemistry Auto-adaptive distributed hash tables Ad hoc grid security infrastructure