Uriel Cabello, José Rodríguez, A. Viveros, S. Mendoza, D. Decouchant
{"title":"Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism","authors":"Uriel Cabello, José Rodríguez, A. Viveros, S. Mendoza, D. Decouchant","doi":"10.1109/ICEEE.2014.6978266","DOIUrl":null,"url":null,"abstract":"The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.","PeriodicalId":6661,"journal":{"name":"2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)","volume":"51 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEEE.2014.6978266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.