Ashish Pandey, P. Calyam, S. Debroy, Songjie Wang, Mauro Lemus Alarcon
The unprecedented growth in edge resources (e.g., scientific instruments, edge servers, sensors) and related data sources has caused a data deluge in scientific application communities. The data processing is increasingly relying on algorithms that utilize machine learning to cope with the heterogeneity, scale, and velocity of the data. At the same time, there is an abundance of low-cost computation resources that can be used for edge-cloud collaborative computing viz., "volunteer edge-cloud (VEC) computing". However, lack of trust in terms of performance, agility, cost, and security (PACS) factors in edge resources is proving to be a barrier for wider adoption of VEC. In this paper, we propose a novel "VECTrust" model for support of trusted resource allocation algorithms in VEC computing environments for scientific data-intensive workflows. Our VECTrust features a two-stage probabilistic model that defines trust of VEC computing cluster resources by considering trustworthiness in metrics relevant to PACS factors. We evaluate our VECTrust model's ability to provide dynamic resource allocation based on PACS factors, while also enhancing edge-cloud trust in a VEC computing testbed. Further, we show that VECTrust is able to create a uniform and robust probability distribution of salient PACS factor related metrics within diverse bioinformatics workflows execution over batches of workflows.
{"title":"VECTrust","authors":"Ashish Pandey, P. Calyam, S. Debroy, Songjie Wang, Mauro Lemus Alarcon","doi":"10.1145/3468737.3494099","DOIUrl":"https://doi.org/10.1145/3468737.3494099","url":null,"abstract":"The unprecedented growth in edge resources (e.g., scientific instruments, edge servers, sensors) and related data sources has caused a data deluge in scientific application communities. The data processing is increasingly relying on algorithms that utilize machine learning to cope with the heterogeneity, scale, and velocity of the data. At the same time, there is an abundance of low-cost computation resources that can be used for edge-cloud collaborative computing viz., \"volunteer edge-cloud (VEC) computing\". However, lack of trust in terms of performance, agility, cost, and security (PACS) factors in edge resources is proving to be a barrier for wider adoption of VEC. In this paper, we propose a novel \"VECTrust\" model for support of trusted resource allocation algorithms in VEC computing environments for scientific data-intensive workflows. Our VECTrust features a two-stage probabilistic model that defines trust of VEC computing cluster resources by considering trustworthiness in metrics relevant to PACS factors. We evaluate our VECTrust model's ability to provide dynamic resource allocation based on PACS factors, while also enhancing edge-cloud trust in a VEC computing testbed. Further, we show that VECTrust is able to create a uniform and robust probability distribution of salient PACS factor related metrics within diverse bioinformatics workflows execution over batches of workflows.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing use of cloud computing for parallel workloads involves, among many problems, resources wastage. When the application does not fully utilize the provisioned resource, the end-of-the-month bill is unnecessarily increased. This is mainly caused by the user's inexperience and naïve behavior. Many studies have attempted to solve this problem by searching for the optimal VM flavor for specific applications with specific inputs. However, most of these solutions require knowledge about the application or require the application's execution on multiple VM flavors. In this work, we propose four new heuristics that recommend cost-effective VMs for parallel workloads based solely on the vCPU-utilization rate of the currently executing VM flavor. We also evaluate them on two scenarios and show that the core-heuristic is capable of recommending VM flavors that have minimal impact on performance and reduce the applications cost, on average, by 1.5x (3.0x) on high (low) vCPU-utilization rate scenarios.
{"title":"Leveraging vCPU-utilization rates to select cost-efficient VMs for parallel workloads","authors":"William F. C. Tavares, M. M. Assis, E. Borin","doi":"10.1145/3468737.3494095","DOIUrl":"https://doi.org/10.1145/3468737.3494095","url":null,"abstract":"The increasing use of cloud computing for parallel workloads involves, among many problems, resources wastage. When the application does not fully utilize the provisioned resource, the end-of-the-month bill is unnecessarily increased. This is mainly caused by the user's inexperience and naïve behavior. Many studies have attempted to solve this problem by searching for the optimal VM flavor for specific applications with specific inputs. However, most of these solutions require knowledge about the application or require the application's execution on multiple VM flavors. In this work, we propose four new heuristics that recommend cost-effective VMs for parallel workloads based solely on the vCPU-utilization rate of the currently executing VM flavor. We also evaluate them on two scenarios and show that the core-heuristic is capable of recommending VM flavors that have minimal impact on performance and reduce the applications cost, on average, by 1.5x (3.0x) on high (low) vCPU-utilization rate scenarios.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116982927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Harzenetter, Uwe Breitenbücher, Ghareeb Falazi, F. Leymann, Adrian Wersching
In recent years, many different deployment automation technologies have been developed to automatically deploy cloud applications. Most of these technologies employ declarative deployment models to describe the deployment of a cloud application by modeling its components, their configurations as well as the relations between them. However, while modeling the deployment of cloud applications declaratively is intuitive, declarative deployment models quickly become complex as they often contain detailed information about the application's components and their configurations. As a result, immense technical expertise is typically required to understand the semantics of a declarative deployment model, i. e., what gets deployed and how the components behave. In this paper, we present an approach that automatically detects design patterns in declarative deployment models. This eases understanding the semantics of deployment models as only the abstract and high-level semantics of the detected patterns must be known instead of technical details about components, relations, and configurations. We demonstrate an open-source implementation based on the Topology and Orchestration Specification for Cloud Applications (TOSCA) and the graphical open-source modeling tool Winery. In addition, we present a detailed case study showing how our approach can be applied in practice using the presented prototype.
{"title":"Automated detection of design patterns in declarative deployment models","authors":"Lukas Harzenetter, Uwe Breitenbücher, Ghareeb Falazi, F. Leymann, Adrian Wersching","doi":"10.1145/3468737.3494085","DOIUrl":"https://doi.org/10.1145/3468737.3494085","url":null,"abstract":"In recent years, many different deployment automation technologies have been developed to automatically deploy cloud applications. Most of these technologies employ declarative deployment models to describe the deployment of a cloud application by modeling its components, their configurations as well as the relations between them. However, while modeling the deployment of cloud applications declaratively is intuitive, declarative deployment models quickly become complex as they often contain detailed information about the application's components and their configurations. As a result, immense technical expertise is typically required to understand the semantics of a declarative deployment model, i. e., what gets deployed and how the components behave. In this paper, we present an approach that automatically detects design patterns in declarative deployment models. This eases understanding the semantics of deployment models as only the abstract and high-level semantics of the detected patterns must be known instead of technical details about components, relations, and configurations. We demonstrate an open-source implementation based on the Topology and Orchestration Specification for Cloud Applications (TOSCA) and the graphical open-source modeling tool Winery. In addition, we present a detailed case study showing how our approach can be applied in practice using the presented prototype.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133482362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Function-as-a-service (FaaS) is an emerging model based on serverless cloud computing technology. It builds on the microservice architecture, where developers implement specific functionality, deploy it to a cloud provider to be executed independently in its own containerised environment. In this paper, we present a software composition approach that orchestrates FaaS from various cloud providers to fulfil the requirements of an application. Our solution integrates a hierarchical planner and a constraint satisfaction solver. Specifically, we discuss the planning method, constraint satisfaction solver, and the coordination of selected functions during the execution. We also present an experiment where our approach is tested using functions in the cloud.
{"title":"Multi-cloud serverless function composition","authors":"J. Quenum, Jonas Josua","doi":"10.1145/3468737.3494090","DOIUrl":"https://doi.org/10.1145/3468737.3494090","url":null,"abstract":"Function-as-a-service (FaaS) is an emerging model based on serverless cloud computing technology. It builds on the microservice architecture, where developers implement specific functionality, deploy it to a cloud provider to be executed independently in its own containerised environment. In this paper, we present a software composition approach that orchestrates FaaS from various cloud providers to fulfil the requirements of an application. Our solution integrates a hierarchical planner and a constraint satisfaction solver. Specifically, we discuss the planning method, constraint satisfaction solver, and the coordination of selected functions during the execution. We also present an experiment where our approach is tested using functions in the cloud.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117304578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data races are notorious concurrency bugs which can cause severe problems, including random crashes and corrupted execution results. However, existing data race detection tools are still challenging for users to use. It takes a significant amount of effort for users to install, configure and properly use a tool. A single tool often cannot find all the bugs in a program. Requiring users to use multiple tools is often impracticable and not productive because of the differences in tool interfaces and report formats. In this paper, we present a cloud-based, service-oriented design and implementation of a race detection service (RDS)1 to detect data races in parallel programs. RDS integrates multiple data race detection tools into a single cloud-based service via a REST API. It defines a standard JSON format to represent data race detection results, facilitating producing user-friendly reports, aggregating output of multiple tools, as well as being easily processed by other tools. RDS also defines a set of policies for aggregating outputs from multiple tools. RDS significantly simplifies the workflow of using data race detection tools and improves the report quality and productivity of performing race detection for parallel programs. Our evaluation shows that RDS can deliver more accurate results with much less effort from users, when compared with the traditional way of using any individual tools. Using four selected tools and DataRaceBench, RDS improves the Adjusted F-1 scores by 8.8% and 12.6% over the best and the average scores, respectively. For the NAS Parallel Benchmark, RDS improves 35% of the adjusted accuracy compared to the average of the tools. Our work studies a new approach of composing software tools for parallel computing via a service-oriented architecture. The same approach and framework can be used to create metaservice for compilers, performance tools, auto-tuning tools, and so on.
{"title":"RDS","authors":"Yaying Shi, Anjia Wang, Yonghong Yan, C. Liao","doi":"10.1145/3468737.3494089","DOIUrl":"https://doi.org/10.1145/3468737.3494089","url":null,"abstract":"Data races are notorious concurrency bugs which can cause severe problems, including random crashes and corrupted execution results. However, existing data race detection tools are still challenging for users to use. It takes a significant amount of effort for users to install, configure and properly use a tool. A single tool often cannot find all the bugs in a program. Requiring users to use multiple tools is often impracticable and not productive because of the differences in tool interfaces and report formats. In this paper, we present a cloud-based, service-oriented design and implementation of a race detection service (RDS)1 to detect data races in parallel programs. RDS integrates multiple data race detection tools into a single cloud-based service via a REST API. It defines a standard JSON format to represent data race detection results, facilitating producing user-friendly reports, aggregating output of multiple tools, as well as being easily processed by other tools. RDS also defines a set of policies for aggregating outputs from multiple tools. RDS significantly simplifies the workflow of using data race detection tools and improves the report quality and productivity of performing race detection for parallel programs. Our evaluation shows that RDS can deliver more accurate results with much less effort from users, when compared with the traditional way of using any individual tools. Using four selected tools and DataRaceBench, RDS improves the Adjusted F-1 scores by 8.8% and 12.6% over the best and the average scores, respectively. For the NAS Parallel Benchmark, RDS improves 35% of the adjusted accuracy compared to the average of the tools. Our work studies a new approach of composing software tools for parallel computing via a service-oriented architecture. The same approach and framework can be used to create metaservice for compilers, performance tools, auto-tuning tools, and so on.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115840790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fedor Smirnov, Chris Engelhardt, Jakob Mittelberger, Behnaz Pourmohseni, T. Fahringer
This paper provides a first presentation of Apollo, an orchestration framework for serverless function compositions distributed across the cloud-edge continuum. Apollo has a modular design that enables a fine-grained decomposition of the runtime orchestration (scheduling, data transmission, etc.) of applications, so that each of the numerous orchestration decisions can be optimized separately, fully exploiting the potential for the optimization of performance and costs. Apollo features (a) a flexible model of the application and the available resources and (b) a decentralized orchestration scheme carried out by independent agents. This flexible structure enables distributing not only the processing but also the orchestration process itself across a large number of resources, each running an independent Apollo instance. In combination with the ability to execute parts of the application directly on the host of each Apollo instance, this unleashes a significant potential for cost and performance optimization by leveraging data locality. Apollo's efficiency and its potential for application performance improvement are demonstrated in a series of experiments---for both synthetic and real function compositions---where Apollo's capability for flexible distribution of tasks between local containers and serverless functions enables a significant application speedup (up to 20X).
{"title":"Apollo: towards an efficient distributed orchestration of serverless function compositions in the cloud-edge continuum","authors":"Fedor Smirnov, Chris Engelhardt, Jakob Mittelberger, Behnaz Pourmohseni, T. Fahringer","doi":"10.1145/3468737.3494103","DOIUrl":"https://doi.org/10.1145/3468737.3494103","url":null,"abstract":"This paper provides a first presentation of Apollo, an orchestration framework for serverless function compositions distributed across the cloud-edge continuum. Apollo has a modular design that enables a fine-grained decomposition of the runtime orchestration (scheduling, data transmission, etc.) of applications, so that each of the numerous orchestration decisions can be optimized separately, fully exploiting the potential for the optimization of performance and costs. Apollo features (a) a flexible model of the application and the available resources and (b) a decentralized orchestration scheme carried out by independent agents. This flexible structure enables distributing not only the processing but also the orchestration process itself across a large number of resources, each running an independent Apollo instance. In combination with the ability to execute parts of the application directly on the host of each Apollo instance, this unleashes a significant potential for cost and performance optimization by leveraging data locality. Apollo's efficiency and its potential for application performance improvement are demonstrated in a series of experiments---for both synthetic and real function compositions---where Apollo's capability for flexible distribution of tasks between local containers and serverless functions enables a significant application speedup (up to 20X).","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125499029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. H. Mortazavi, Hossein Shafieirad, M. Bahnasy, A. Munir, Yuanhui Cheng, Anudeep Das, Y. Ganjali
Resource optimization algorithms in the cloud are ever more data-driven and decision-making has become reliant on more and more data flowing from different cloud components. Applications and the network control layer on the other hand mainly operate in isolation without direct communication. Recently, increased integration between the network and application has been advocated to benefit both the application and the network but the information exchange has mostly been limited to flow level information. We argue that in the realm of datacenter networks, sharing additional information such as the function processing times and deployment data for planning jobs and tasks can result in major optimization benefits for the network. In this study we present Accord as a Network Application Integration solution to achieve a holistic network-application management solution. We propose a protocol as an API between the network and application then we build a system that uses the processing and networking data from the application to perform network scheduling and routing optimizations. We demonstrate that for a sample distributed learning application, an Accord enhanced solution that uses the application processing information can yield up to 27.8% reduction in Job Completion Time (JCT). In addition, we show how Accord can yield better results for routing decisions through a reinforcement learning algorithm that outperforms first shortest path first by %13.
{"title":"Accord","authors":"S. H. Mortazavi, Hossein Shafieirad, M. Bahnasy, A. Munir, Yuanhui Cheng, Anudeep Das, Y. Ganjali","doi":"10.1145/3468737.3494102","DOIUrl":"https://doi.org/10.1145/3468737.3494102","url":null,"abstract":"Resource optimization algorithms in the cloud are ever more data-driven and decision-making has become reliant on more and more data flowing from different cloud components. Applications and the network control layer on the other hand mainly operate in isolation without direct communication. Recently, increased integration between the network and application has been advocated to benefit both the application and the network but the information exchange has mostly been limited to flow level information. We argue that in the realm of datacenter networks, sharing additional information such as the function processing times and deployment data for planning jobs and tasks can result in major optimization benefits for the network. In this study we present Accord as a Network Application Integration solution to achieve a holistic network-application management solution. We propose a protocol as an API between the network and application then we build a system that uses the processing and networking data from the application to perform network scheduling and routing optimizations. We demonstrate that for a sample distributed learning application, an Accord enhanced solution that uses the application processing information can yield up to 27.8% reduction in Job Completion Time (JCT). In addition, we show how Accord can yield better results for routing decisions through a reinforcement learning algorithm that outperforms first shortest path first by %13.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115184543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, containers are widely used to process big data in clouds. To prevent information leakage from containers, applications in containers can protect sensitive information using enclaves provided by Intel SGX. The memory of enclaves is encrypted by a CPU using its internal keys. However, the execution of SGX applications cannot be continued after the container running those applications is migrated. This is because enclave memory cannot be correctly decrypted at the destination host. This paper proposes MigSGX for enabling the continuous execution of SGX applications after container migration. Since the states of enclaves cannot be directly accessed from the outside, MigSGX securely invokes each enclave and makes it dump and load its state. Atthe dump time, each enclave re-encrypts its state using a CPU-independent key to protect sensitive information. For space- and time-efficiency, MigSGX saves and restores a large amount of enclave memory in a pipelined manner. We have implemented MigSGX in the Intel SGX SDK and CRIU and showed that pipelining could improve migration performance by up to 52%. The memory necessary for migration was reduced only to 0.15%.
{"title":"MigSGX","authors":"K. Nakashima, Kenichi Kourai","doi":"10.1145/3468737.3494088","DOIUrl":"https://doi.org/10.1145/3468737.3494088","url":null,"abstract":"Recently, containers are widely used to process big data in clouds. To prevent information leakage from containers, applications in containers can protect sensitive information using enclaves provided by Intel SGX. The memory of enclaves is encrypted by a CPU using its internal keys. However, the execution of SGX applications cannot be continued after the container running those applications is migrated. This is because enclave memory cannot be correctly decrypted at the destination host. This paper proposes MigSGX for enabling the continuous execution of SGX applications after container migration. Since the states of enclaves cannot be directly accessed from the outside, MigSGX securely invokes each enclave and makes it dump and load its state. Atthe dump time, each enclave re-encrypts its state using a CPU-independent key to protect sensitive information. For space- and time-efficiency, MigSGX saves and restores a large amount of enclave memory in a pipelined manner. We have implemented MigSGX in the Intel SGX SDK and CRIU and showed that pipelining could improve migration performance by up to 52%. The memory necessary for migration was reduced only to 0.15%.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126084181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scalable stream processing systems (SPS) often require external storage systems for long-term storage of non-emphemeral state. Such state cannot be accommodated in the internal stores of SPSes that are mainly geared for fault tolerance of streaming jobs, lack externally visible APIs, and their state is disposed of at the end of such jobs. Recent research have pointed to scalable in-memory key-value stores (KVS) as an efficient solution to manage external state. While such data stores have been interconnected with scalable streaming systems, they are currently managed independently, missing opportunities for optimizations, such as exploiting locality between stream partitions and table shards, as well as coordinating elasticity actions. Both processing and data management systems are typically designed for scalability, however coordination between them poses a significant challenge. In this work we describe Amoeba, a system that dynamically adapts data-partitioning schemes and/or task or data placement across systems to eliminate unnecessary network communication across nodes. Our evaluation using state-of-the art systems, such as the Flink SPS and Redis KVS, demonstrated 2.6x performance improvement when aligning SPS tasks with KVS shards in AWS deployments of up to 64 nodes.
{"title":"Amoeba: aligning stream processing operators with externally-managed state","authors":"Antonis Papaioannou, K. Magoutis","doi":"10.1145/3468737.3494096","DOIUrl":"https://doi.org/10.1145/3468737.3494096","url":null,"abstract":"Scalable stream processing systems (SPS) often require external storage systems for long-term storage of non-emphemeral state. Such state cannot be accommodated in the internal stores of SPSes that are mainly geared for fault tolerance of streaming jobs, lack externally visible APIs, and their state is disposed of at the end of such jobs. Recent research have pointed to scalable in-memory key-value stores (KVS) as an efficient solution to manage external state. While such data stores have been interconnected with scalable streaming systems, they are currently managed independently, missing opportunities for optimizations, such as exploiting locality between stream partitions and table shards, as well as coordinating elasticity actions. Both processing and data management systems are typically designed for scalability, however coordination between them poses a significant challenge. In this work we describe Amoeba, a system that dynamically adapts data-partitioning schemes and/or task or data placement across systems to eliminate unnecessary network communication across nodes. Our evaluation using state-of-the art systems, such as the Flink SPS and Redis KVS, demonstrated 2.6x performance improvement when aligning SPS tasks with KVS shards in AWS deployments of up to 64 nodes.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125172884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1163/2352-0272_emho_sim_022957
Anshul Jindal, Julian Frielinghaus, Mohak Chadha, M. Gerndt
{"title":"Courier","authors":"Anshul Jindal, Julian Frielinghaus, Mohak Chadha, M. Gerndt","doi":"10.1163/2352-0272_emho_sim_022957","DOIUrl":"https://doi.org/10.1163/2352-0272_emho_sim_022957","url":null,"abstract":"","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129296276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}