Mahsa Raeiszadeh;Amin Ebrahimzadeh;Roch H. Glitho;Johan Eker;Raquel A. F. Mini
{"title":"Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud Applications","authors":"Mahsa Raeiszadeh;Amin Ebrahimzadeh;Roch H. Glitho;Johan Eker;Raquel A. F. Mini","doi":"10.1109/TMLCN.2025.3527919","DOIUrl":null,"url":null,"abstract":"The complexity and dynamicity of microservice architectures in cloud environments present substantial challenges to the reliability and availability of the services built on these architectures. Therefore, effective anomaly detection is crucial to prevent impending failures and resolve them promptly. Distributed data analysis techniques based on machine learning (ML) have recently gained attention in detecting anomalies in microservice systems. ML-based anomaly detection techniques mostly require centralized data collection and processing, which may raise scalability and computational issues in practice. In this paper, we propose an Asynchronous Real-Time Federated Learning (ART-FL) approach for anomaly detection in cloud-based microservice systems. In our approach, edge clients perform real-time learning with continuous streaming local data. At the edge clients, we model intra-service behaviors and inter-service dependencies in multi-source distributed data based on a Span Causal Graph (SCG) representation and train a model through a combination of Graph Neural Network (GNN) and Positive and Unlabeled (PU) learning. Our FL approach updates the global model in an asynchronous manner to achieve accurate and efficient anomaly detection, addressing computational overhead across diverse edge clients, including those that experience delays. Our trace-driven evaluations indicate that the proposed method outperforms the state-of-the-art anomaly detection methods by 4% in terms of <inline-formula> <tex-math>$F_{1}$ </tex-math></inline-formula>-score while meeting the given time efficiency and scalability requirements.","PeriodicalId":100641,"journal":{"name":"IEEE Transactions on Machine Learning in Communications and Networking","volume":"3 ","pages":"176-194"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10835399","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Machine Learning in Communications and Networking","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10835399/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The complexity and dynamicity of microservice architectures in cloud environments present substantial challenges to the reliability and availability of the services built on these architectures. Therefore, effective anomaly detection is crucial to prevent impending failures and resolve them promptly. Distributed data analysis techniques based on machine learning (ML) have recently gained attention in detecting anomalies in microservice systems. ML-based anomaly detection techniques mostly require centralized data collection and processing, which may raise scalability and computational issues in practice. In this paper, we propose an Asynchronous Real-Time Federated Learning (ART-FL) approach for anomaly detection in cloud-based microservice systems. In our approach, edge clients perform real-time learning with continuous streaming local data. At the edge clients, we model intra-service behaviors and inter-service dependencies in multi-source distributed data based on a Span Causal Graph (SCG) representation and train a model through a combination of Graph Neural Network (GNN) and Positive and Unlabeled (PU) learning. Our FL approach updates the global model in an asynchronous manner to achieve accurate and efficient anomaly detection, addressing computational overhead across diverse edge clients, including those that experience delays. Our trace-driven evaluations indicate that the proposed method outperforms the state-of-the-art anomaly detection methods by 4% in terms of $F_{1}$ -score while meeting the given time efficiency and scalability requirements.