Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne
{"title":"利用控制台和资源使用日志增加大规模分布式系统错误处理时间窗口的研究","authors":"Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne","doi":"10.1109/TRUSTCOM-BIGDATASE-ISPA.2015.613","DOIUrl":null,"url":null,"abstract":"Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the impact of faults. Typically, the state of the underlying system is captured through messages that are recorded in a log file, which has been proven useful to system administrators in understanding the root-causes of system failures (and for their subsequent debugging). However, the time window between when the first error message is detected in the log file and time of the ensuing failure may not be large enough to allow the administrators to save the state of the running application, which will result in lost execution time. We thus address this fundamental question: Is it possible to extend this time window? The answer is positive: We show that, by using (i) resource usage logs to track anomalous resource usage and (ii) error logs to identify root-causes of system failures, it is possible to increase the time window, on average, by 50 minutes. These files were those obtained for the Ranger Supercomputer from TACC. We achieve this by applying anomaly detection techniques on resource usage data and conducting a root-cause analysis on error log files.","PeriodicalId":277092,"journal":{"name":"2015 IEEE Trustcom/BigDataSE/ISPA","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs\",\"authors\":\"Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne\",\"doi\":\"10.1109/TRUSTCOM-BIGDATASE-ISPA.2015.613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the impact of faults. Typically, the state of the underlying system is captured through messages that are recorded in a log file, which has been proven useful to system administrators in understanding the root-causes of system failures (and for their subsequent debugging). However, the time window between when the first error message is detected in the log file and time of the ensuing failure may not be large enough to allow the administrators to save the state of the running application, which will result in lost execution time. We thus address this fundamental question: Is it possible to extend this time window? The answer is positive: We show that, by using (i) resource usage logs to track anomalous resource usage and (ii) error logs to identify root-causes of system failures, it is possible to increase the time window, on average, by 50 minutes. These files were those obtained for the Ranger Supercomputer from TACC. We achieve this by applying anomaly detection techniques on resource usage data and conducting a root-cause analysis on error log files.\",\"PeriodicalId\":277092,\"journal\":{\"name\":\"2015 IEEE Trustcom/BigDataSE/ISPA\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE Trustcom/BigDataSE/ISPA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TRUSTCOM-BIGDATASE-ISPA.2015.613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Trustcom/BigDataSE/ISPA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TRUSTCOM-BIGDATASE-ISPA.2015.613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs
Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the impact of faults. Typically, the state of the underlying system is captured through messages that are recorded in a log file, which has been proven useful to system administrators in understanding the root-causes of system failures (and for their subsequent debugging). However, the time window between when the first error message is detected in the log file and time of the ensuing failure may not be large enough to allow the administrators to save the state of the running application, which will result in lost execution time. We thus address this fundamental question: Is it possible to extend this time window? The answer is positive: We show that, by using (i) resource usage logs to track anomalous resource usage and (ii) error logs to identify root-causes of system failures, it is possible to increase the time window, on average, by 50 minutes. These files were those obtained for the Ranger Supercomputer from TACC. We achieve this by applying anomaly detection techniques on resource usage data and conducting a root-cause analysis on error log files.