Yuhan Zhu , Jian Wang , Bing Li , Yuqi Zhao , Zekun Zhang , Yiming Xiong , Shiping Chen
{"title":"MicroIRC: Instance-level Root Cause Localization for Microservice Systems","authors":"Yuhan Zhu , Jian Wang , Bing Li , Yuqi Zhao , Zekun Zhang , Yiming Xiong , Shiping Chen","doi":"10.1016/j.jss.2024.112145","DOIUrl":null,"url":null,"abstract":"<div><p>The use of microservice architecture is gaining popularity in the development of web applications. However, identifying the root cause of a failure can be challenging due to the complexity of interconnected microservices, long service invocation links, dynamic changes in service states, and the abundance of service deployment nodes. Furthermore, as each microservice may have multiple instances, it can be difficult to identify instance-level failures promptly and effectively when the microservice topology and failure types change dynamically. To address this issue, we propose MicroIRC (Instance-level Root Cause Localization for Microservice Systems), a novel metrics-based approach that localizes root causes at the instance level while exhibiting robustness to adapt to dynamic changes in topology and new types of anomalies. We begin by training a graph neural network to fit different root cause types based on extracted time series features of microservice system metrics. Next, we construct a heterogeneous weighted topology (HWT) of microservice systems and execute a personalized random walk to identify root cause candidates. These candidates, along with real-time metrics from the anomalous time window, are then fed into the trained graph neural network to generate a ranked root cause list. Experiments conducted on five real-world datasets demonstrate that MicroIRC can accurately locate the root cause of microservices at the instance level, achieving a precision rate of 93.1% for the top five results. Furthermore, compared to the state-of-the-art methods, MicroIRC can improve the accuracy of root cause localization by more than 17% at the service level and more than 11.5% at the instance level. Remarkably, it exhibits robustness in scenarios involving new failure types, achieving an accuracy of 84.2% for the top result amid dynamic topological changes.</p></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121224001900","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
The use of microservice architecture is gaining popularity in the development of web applications. However, identifying the root cause of a failure can be challenging due to the complexity of interconnected microservices, long service invocation links, dynamic changes in service states, and the abundance of service deployment nodes. Furthermore, as each microservice may have multiple instances, it can be difficult to identify instance-level failures promptly and effectively when the microservice topology and failure types change dynamically. To address this issue, we propose MicroIRC (Instance-level Root Cause Localization for Microservice Systems), a novel metrics-based approach that localizes root causes at the instance level while exhibiting robustness to adapt to dynamic changes in topology and new types of anomalies. We begin by training a graph neural network to fit different root cause types based on extracted time series features of microservice system metrics. Next, we construct a heterogeneous weighted topology (HWT) of microservice systems and execute a personalized random walk to identify root cause candidates. These candidates, along with real-time metrics from the anomalous time window, are then fed into the trained graph neural network to generate a ranked root cause list. Experiments conducted on five real-world datasets demonstrate that MicroIRC can accurately locate the root cause of microservices at the instance level, achieving a precision rate of 93.1% for the top five results. Furthermore, compared to the state-of-the-art methods, MicroIRC can improve the accuracy of root cause localization by more than 17% at the service level and more than 11.5% at the instance level. Remarkably, it exhibits robustness in scenarios involving new failure types, achieving an accuracy of 84.2% for the top result amid dynamic topological changes.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
• Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
• Agile, model-driven, service-oriented, open source and global software development
• Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
• Human factors and management concerns of software development
• Data management and big data issues of software systems
• Metrics and evaluation, data mining of software development resources
• Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.