Systematic Evaluation of Deep Learning Models for Log-based Failure Prediction

IF 3.6 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-06-20 DOI:10.1007/s10664-024-10501-4

Fatemeh Hadadi, Joshua H. Dawes, Donghwan Shin, Domenico Bianculli, Lionel Briand

{"title":"Systematic Evaluation of Deep Learning Models for Log-based Failure Prediction","authors":"Fatemeh Hadadi, Joshua H. Dawes, Donghwan Shin, Domenico Bianculli, Lionel Briand","doi":"10.1007/s10664-024-10501-4","DOIUrl":null,"url":null,"abstract":"With the increasing complexity and scope of software systems, their dependability is crucial. The analysis of log data recorded during system execution can enable engineers to automatically predict failures at run time. Several Machine Learning (ML) techniques, including traditional ML and Deep Learning (DL), have been proposed to automate such tasks. However, current empirical studies are limited in terms of covering all main DL types—Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and transformer—as well as examining them on a wide range of diverse datasets. In this paper, we aim to address these issues by systematically investigating the combination of log data embedding strategies and DL types for failure prediction. To that end, we propose a modular architecture to accommodate various configurations of embedding strategies and DL-based encoders. To further investigate how dataset characteristics such as dataset size and failure percentage affect model accuracy, we synthesised 360 datasets, with varying characteristics, for three distinct system behavioural models, based on a systematic and automated generation approach. Using the F1 score metric, our results show that the best overall performing configuration is a CNN-based encoder with Logkey2vec. Additionally, we provide specific dataset conditions, namely a dataset size \\(>350\\) or a failure percentage \\(>7.5\\%\\), under which this configuration demonstrates high accuracy for failure prediction.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"16 1","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10501-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

With the increasing complexity and scope of software systems, their dependability is crucial. The analysis of log data recorded during system execution can enable engineers to automatically predict failures at run time. Several Machine Learning (ML) techniques, including traditional ML and Deep Learning (DL), have been proposed to automate such tasks. However, current empirical studies are limited in terms of covering all main DL types—Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and transformer—as well as examining them on a wide range of diverse datasets. In this paper, we aim to address these issues by systematically investigating the combination of log data embedding strategies and DL types for failure prediction. To that end, we propose a modular architecture to accommodate various configurations of embedding strategies and DL-based encoders. To further investigate how dataset characteristics such as dataset size and failure percentage affect model accuracy, we synthesised 360 datasets, with varying characteristics, for three distinct system behavioural models, based on a systematic and automated generation approach. Using the F1 score metric, our results show that the best overall performing configuration is a CNN-based encoder with Logkey2vec. Additionally, we provide specific dataset conditions, namely a dataset size \(>350\) or a failure percentage \(>7.5\%\), under which this configuration demonstrates high accuracy for failure prediction.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对基于日志的故障预测深度学习模型进行系统评估

随着软件系统的复杂性和范围不断增加，其可靠性至关重要。通过分析系统执行过程中记录的日志数据，工程师可以在运行时自动预测故障。目前已经提出了几种机器学习（ML）技术，包括传统的 ML 和深度学习（DL），用于自动完成此类任务。然而，目前的实证研究在涵盖所有主要的深度学习类型--递归神经网络（RNN）、卷积神经网络（CNN）和变压器--以及在各种不同的数据集上对它们进行检验方面都很有限。在本文中，我们旨在通过系统地研究日志数据嵌入策略和故障预测 DL 类型的组合来解决这些问题。为此，我们提出了一种模块化架构，以适应嵌入策略和基于 DL 的编码器的各种配置。为了进一步研究数据集的特征（如数据集大小和故障百分比）对模型准确性的影响，我们基于系统化的自动生成方法，为三种不同的系统行为模型合成了 360 个特征各异的数据集。使用 F1 分数指标，我们的结果表明，整体性能最佳的配置是基于 CNN 的编码器和 Logkey2vec。此外，我们还提供了特定的数据集条件，即数据集大小（350）或故障百分比（7.5%），在这些条件下，该配置的故障预测准确率很高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.