{"title":"Anomaly Detection Method for Chiller System of Supercomputer","authors":"Yuqi Li, Jinghua Feng, Changsong Li","doi":"10.1145/3341069.3341076","DOIUrl":null,"url":null,"abstract":"Supercomputer reliability decreases with the increase of its scale. In this situation, the method to reduce the supercomputer MTTR (mean time to repair) plays a critical role in system management. Engineers at present typically use supercomputer metrics to construct anomaly detection methods and reduce the MTTR of supercomputers. However, the infrastructure data, including chilled water data, of supercomputers are neglected. This paper proposes an ensemble learning method for anomaly detection, which includes LSTM (long short-term memory) and linear regression algorithm. On the basis of this method, we construct an anomaly monitor system by using chilled water data. Experimental results show that the method can help engineers precisely detect anomalies.","PeriodicalId":411198,"journal":{"name":"Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3341069.3341076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Supercomputer reliability decreases with the increase of its scale. In this situation, the method to reduce the supercomputer MTTR (mean time to repair) plays a critical role in system management. Engineers at present typically use supercomputer metrics to construct anomaly detection methods and reduce the MTTR of supercomputers. However, the infrastructure data, including chilled water data, of supercomputers are neglected. This paper proposes an ensemble learning method for anomaly detection, which includes LSTM (long short-term memory) and linear regression algorithm. On the basis of this method, we construct an anomaly monitor system by using chilled water data. Experimental results show that the method can help engineers precisely detect anomalies.