{"title":"DyCause: Crowdsourcing to Diagnose Microservice Kernel Failure","authors":"Yicheng Pan, Meng Ma, Xinrui Jiang, Ping Wang","doi":"10.1109/tdsc.2022.3233915","DOIUrl":null,"url":null,"abstract":"Today many web applications in the cloud (apps) are built based on microservices. However, as the anomaly propagates in a highly dynamic and complex way, troubleshooting them becomes full of challenges. Existing diagnostic methods are mostly designed based on monitoring metrics retrieved from the microservice system kernel. Therefore, application owners and even site reliability engineers (SREs) cannot effectively resort to those methods when the microservice systems lack such a comprehensive monitoring infrastructure. In this article, we develop DyCause, a crowdsourcing solution to the asymmetric diagnostic information problem. Our solution collects the operational status of kernel services collaboratively from the user space and initiates diagnosis on demand. Without the requirement of any architectural or functional infrastructure, it is both fast and lightweight to deploy DyCause in a microservice system. In order to discover the fine-grained dynamic causalities between services during the anomaly, we also design an efficient algorithm based on statistical analysis. Based on this algorithm, we can also analyze the anomaly propagation paths within the microservice system and generate a better interpretable diagnosis. In our evaluation, we test DyCause in a controlled simulation environment and a real-world cloud system. Our results have shown that DyCause has the best accuracy and efficiency among several state-of-the-art methods and is more robust in terms of parameters.","PeriodicalId":13047,"journal":{"name":"IEEE Transactions on Dependable and Secure Computing","volume":null,"pages":null},"PeriodicalIF":7.0000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Dependable and Secure Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tdsc.2022.3233915","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Today many web applications in the cloud (apps) are built based on microservices. However, as the anomaly propagates in a highly dynamic and complex way, troubleshooting them becomes full of challenges. Existing diagnostic methods are mostly designed based on monitoring metrics retrieved from the microservice system kernel. Therefore, application owners and even site reliability engineers (SREs) cannot effectively resort to those methods when the microservice systems lack such a comprehensive monitoring infrastructure. In this article, we develop DyCause, a crowdsourcing solution to the asymmetric diagnostic information problem. Our solution collects the operational status of kernel services collaboratively from the user space and initiates diagnosis on demand. Without the requirement of any architectural or functional infrastructure, it is both fast and lightweight to deploy DyCause in a microservice system. In order to discover the fine-grained dynamic causalities between services during the anomaly, we also design an efficient algorithm based on statistical analysis. Based on this algorithm, we can also analyze the anomaly propagation paths within the microservice system and generate a better interpretable diagnosis. In our evaluation, we test DyCause in a controlled simulation environment and a real-world cloud system. Our results have shown that DyCause has the best accuracy and efficiency among several state-of-the-art methods and is more robust in terms of parameters.
期刊介绍:
The "IEEE Transactions on Dependable and Secure Computing (TDSC)" is a prestigious journal that publishes high-quality, peer-reviewed research in the field of computer science, specifically targeting the development of dependable and secure computing systems and networks. This journal is dedicated to exploring the fundamental principles, methodologies, and mechanisms that enable the design, modeling, and evaluation of systems that meet the required levels of reliability, security, and performance.
The scope of TDSC includes research on measurement, modeling, and simulation techniques that contribute to the understanding and improvement of system performance under various constraints. It also covers the foundations necessary for the joint evaluation, verification, and design of systems that balance performance, security, and dependability.
By publishing archival research results, TDSC aims to provide a valuable resource for researchers, engineers, and practitioners working in the areas of cybersecurity, fault tolerance, and system reliability. The journal's focus on cutting-edge research ensures that it remains at the forefront of advancements in the field, promoting the development of technologies that are critical for the functioning of modern, complex systems.