Paolo Notaro, Qiao Yu, Soroush Haeri, Jorge Cardoso, M. Gerndt
{"title":"An Optical Transceiver Reliability Study based on SFP Monitoring and OS-level Metric Data","authors":"Paolo Notaro, Qiao Yu, Soroush Haeri, Jorge Cardoso, M. Gerndt","doi":"10.1109/CCGrid57682.2023.00011","DOIUrl":null,"url":null,"abstract":"The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.