Fred Lin, A. Davoli, I. Akbar, Sukumar Kalmanje, Leandro Silva, J. Stamford, Yanai S. Golany, Jim Piazza, S. Sankar
{"title":"预测大规模数据中心硬件故障的补救措施","authors":"Fred Lin, A. Davoli, I. Akbar, Sukumar Kalmanje, Leandro Silva, J. Stamford, Yanai S. Golany, Jim Piazza, S. Sankar","doi":"10.1109/dsn-s50200.2020.00016","DOIUrl":null,"url":null,"abstract":"Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.","PeriodicalId":419045,"journal":{"name":"2020 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Predicting Remediations for Hardware Failures in Large-Scale Datacenters\",\"authors\":\"Fred Lin, A. Davoli, I. Akbar, Sukumar Kalmanje, Leandro Silva, J. Stamford, Yanai S. Golany, Jim Piazza, S. Sankar\",\"doi\":\"10.1109/dsn-s50200.2020.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.\",\"PeriodicalId\":419045,\"journal\":{\"name\":\"2020 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/dsn-s50200.2020.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/dsn-s50200.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting Remediations for Hardware Failures in Large-Scale Datacenters
Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.