Exploring the impact of chaos engineering with various user loads on cloud native applications: an exploratory empirical study

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Computing Pub Date : 2024-05-05 DOI:10.1007/s00607-024-01292-z

Amro Al-Said Ahmad, Lamis F. Al-Qora’n, Ahmad Zayed

{"title":"Exploring the impact of chaos engineering with various user loads on cloud native applications: an exploratory empirical study","authors":"Amro Al-Said Ahmad, Lamis F. Al-Qora’n, Ahmad Zayed","doi":"10.1007/s00607-024-01292-z","DOIUrl":null,"url":null,"abstract":"<p>One of the most popular models that provide computer resources today is cloud computing. Today’s dynamic and successful platforms are created to take advantage of various resources available from service providers. Ensuring the performance and availability of such resources and services is a crucial problem. Any software system may be subject to faults that might propagate to cause failures. Such faults with the potential of contributing to failures are critical because they impair performance and result in a delayed reaction, which is regarded as a dependability problem. To ensure that critical faults can be discovered as soon as possible, the impact of such faults on the system must be tested. The performance and dependability of cloud-native systems are examined in this empirical study using fault injection, one of the chaos engineering techniques. The study explores the impacts and results of injecting various delay times into two cloud-native applications with diverse user numbers. The performance of the applications with various numbers of users is measured in relation to these delays, which accordingly reflects measuring the dependability of those systems. Firstly, the systems’ architecture were identified, and serverless with two Lambda functions and containerised microservices applications were chosen, which depend on utilising and incorporating cloud-native services. Secondly, faults are injected in order to quantify performance attributes such as throughput and latency. The results of several controlled experiments carried out in real-world cloud environments provide exploratory empirical data, which promoted comparisons and statistical analysis that we utilised to identify the behaviour of the application while experiencing stress. Typical results from this investigation include an overall reduction in performance that is embodied in an increase in latency with injecting delays. However, a remarkable result is noticed at a particular delay in which defects and availability problems appear out of nowhere. These findings assist in highlighting the value of using chaos engineering in general and fault injection in particular to assess the dependability of cloud-native applications and to find unpredicted failures that could arise quickly from defects that aren’t supposed to spread and result in dependability issues.</p>","PeriodicalId":10718,"journal":{"name":"Computing","volume":"18 1","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00607-024-01292-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

One of the most popular models that provide computer resources today is cloud computing. Today’s dynamic and successful platforms are created to take advantage of various resources available from service providers. Ensuring the performance and availability of such resources and services is a crucial problem. Any software system may be subject to faults that might propagate to cause failures. Such faults with the potential of contributing to failures are critical because they impair performance and result in a delayed reaction, which is regarded as a dependability problem. To ensure that critical faults can be discovered as soon as possible, the impact of such faults on the system must be tested. The performance and dependability of cloud-native systems are examined in this empirical study using fault injection, one of the chaos engineering techniques. The study explores the impacts and results of injecting various delay times into two cloud-native applications with diverse user numbers. The performance of the applications with various numbers of users is measured in relation to these delays, which accordingly reflects measuring the dependability of those systems. Firstly, the systems’ architecture were identified, and serverless with two Lambda functions and containerised microservices applications were chosen, which depend on utilising and incorporating cloud-native services. Secondly, faults are injected in order to quantify performance attributes such as throughput and latency. The results of several controlled experiments carried out in real-world cloud environments provide exploratory empirical data, which promoted comparisons and statistical analysis that we utilised to identify the behaviour of the application while experiencing stress. Typical results from this investigation include an overall reduction in performance that is embodied in an increase in latency with injecting delays. However, a remarkable result is noticed at a particular delay in which defects and availability problems appear out of nowhere. These findings assist in highlighting the value of using chaos engineering in general and fault injection in particular to assess the dependability of cloud-native applications and to find unpredicted failures that could arise quickly from defects that aren’t supposed to spread and result in dependability issues.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索各种用户负载的混沌工程对云原生应用程序的影响：一项探索性实证研究

云计算是当今最流行的计算机资源提供模式之一。当今充满活力的成功平台就是为了利用服务提供商提供的各种资源而创建的。确保此类资源和服务的性能和可用性是一个关键问题。任何软件系统都可能会出现故障，这些故障可能会传播并导致故障。这些可能导致故障的故障非常关键，因为它们会损害性能并导致延迟反应，这被视为可靠性问题。为确保尽快发现关键故障，必须测试此类故障对系统的影响。本实证研究使用故障注入（混沌工程技术之一）对云原生系统的性能和可靠性进行了检验。研究探讨了向两个用户数量各异的云原生应用注入不同延迟时间的影响和结果。根据这些延迟来衡量不同用户数量的应用程序的性能，从而反映出这些系统的可靠性。首先，确定了系统的架构，并选择了带有两个 Lambda 函数的无服务器和容器化微服务应用程序，这取决于对云原生服务的利用和整合。其次，注入故障以量化吞吐量和延迟等性能属性。在现实世界云环境中进行的几项受控实验的结果提供了探索性的经验数据，促进了比较和统计分析，我们利用这些数据来确定应用程序在承受压力时的行为。这项调查的典型结果包括整体性能下降，这体现在注入延迟导致的延迟增加上。然而，在某一特定延迟时，缺陷和可用性问题突然出现，结果令人瞩目。这些发现有助于突出使用混沌工程（尤其是故障注入）评估云原生应用程序的可靠性以及发现未预见到的故障的价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computing 工程技术-计算机：理论方法

CiteScore

8.20

自引率

2.70%

发文量

107

审稿时长

3 months

期刊介绍： Computing publishes original papers, short communications and surveys on all fields of computing. The contributions should be written in English and may be of theoretical or applied nature, the essential criteria are computational relevance and systematic foundation of results.