The importance of large-scale data analysis has shown a recent increase in a wide variety of areas, such as natural language processing, sensor data analysis, and scientific computing. Such an analysis application typically reuses existing programs as components and is often required to continuously process new data with low latency while processing large-scale data on distributed computation nodes. However, existing frameworks for combining programs into a parallel data analysis pipeline (e.g., workflow) are plagued by the following issues: (1) Most frameworks are oriented toward high-throughput batch processing, which leads to high latency. (2) A specific language is often imposed for the composition and/or such a specific structure as a simple unidirectional dataflow among constituting tasks. (3) A program used as a component often takes a long time to start up due to the heavy load at initialization, which is referred to as the startup overhead. Our solution to these problems is a remote procedure call (RPC)-based composition, which is achieved by our middleware Rapid Service Connector (RaSC). RaSC can easily wrap an ordinary program and make it accessible as an RPC service, called a RaSC service. Using such component programs as RaSC services enables us to integrate them into one program with low latency without being restricted to a specific workflow language or dataflow structure. In addition, a RaSC service masks the startup overhead of a component program by keeping the processes of the component program alive across RPC requests. We also proposed architecture that automatically manages the number of processes to maximize the throughput. The experimental results showed that our approach excels in overall throughput as well as latency, despite its RPC overhead. We also showed that our approach can adapt to runtime changes in the throughput requirements.
{"title":"Low Latency and Resource-Aware Program Composition for Large-Scale Data Analysis","authors":"Masahiro Tanaka, K. Taura, Kentaro Torisawa","doi":"10.1109/CCGrid.2016.88","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.88","url":null,"abstract":"The importance of large-scale data analysis has shown a recent increase in a wide variety of areas, such as natural language processing, sensor data analysis, and scientific computing. Such an analysis application typically reuses existing programs as components and is often required to continuously process new data with low latency while processing large-scale data on distributed computation nodes. However, existing frameworks for combining programs into a parallel data analysis pipeline (e.g., workflow) are plagued by the following issues: (1) Most frameworks are oriented toward high-throughput batch processing, which leads to high latency. (2) A specific language is often imposed for the composition and/or such a specific structure as a simple unidirectional dataflow among constituting tasks. (3) A program used as a component often takes a long time to start up due to the heavy load at initialization, which is referred to as the startup overhead. Our solution to these problems is a remote procedure call (RPC)-based composition, which is achieved by our middleware Rapid Service Connector (RaSC). RaSC can easily wrap an ordinary program and make it accessible as an RPC service, called a RaSC service. Using such component programs as RaSC services enables us to integrate them into one program with low latency without being restricted to a specific workflow language or dataflow structure. In addition, a RaSC service masks the startup overhead of a component program by keeping the processes of the component program alive across RPC requests. We also proposed architecture that automatically manages the number of processes to maximize the throughput. The experimental results showed that our approach excels in overall throughput as well as latency, despite its RPC overhead. We also showed that our approach can adapt to runtime changes in the throughput requirements.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126171559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Managing large systems in order to guarantee certain behavior is a difficult problem due to their dynamic behavior and complex interactions. Policies have been shown to provide a very expressive and easy way to define such desired behaviors, mainly because they separate the definition of desired behavior from the enforcement mechanism, allowing either one to be changed fairly easily. Unfortunately, it is often difficult to define policies in terms of attributes that can be measured and/or directly controlled, or to set adaptable (i.e. non-static) parameters in order to account for rapidly changing system behavior. Dynamic policies are meant to solve these problems by allowing system administrators to define higher level parameters, which are more closely related to the business goals, while providing an automated mechanism to adapt them at a lower level, where attributes can be measured and/or controlled. Here, we present a way to define such policies, and a machine learning model that is able to dynamically apply lower level static policies by learning a hidden relationship between the high level business attribute space, and the low level monitoring space. We show that this relationship exists, and that we can learn it producing an error of at most 8.78% at least 96% of the time.
{"title":"Dynamic Adaptation of Policies Using Machine Learning","authors":"Alejandro Pelaez, Andres Quiroz, M. Parashar","doi":"10.1109/CCGrid.2016.64","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.64","url":null,"abstract":"Managing large systems in order to guarantee certain behavior is a difficult problem due to their dynamic behavior and complex interactions. Policies have been shown to provide a very expressive and easy way to define such desired behaviors, mainly because they separate the definition of desired behavior from the enforcement mechanism, allowing either one to be changed fairly easily. Unfortunately, it is often difficult to define policies in terms of attributes that can be measured and/or directly controlled, or to set adaptable (i.e. non-static) parameters in order to account for rapidly changing system behavior. Dynamic policies are meant to solve these problems by allowing system administrators to define higher level parameters, which are more closely related to the business goals, while providing an automated mechanism to adapt them at a lower level, where attributes can be measured and/or controlled. Here, we present a way to define such policies, and a machine learning model that is able to dynamically apply lower level static policies by learning a hidden relationship between the high level business attribute space, and the low level monitoring space. We show that this relationship exists, and that we can learn it producing an error of at most 8.78% at least 96% of the time.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129763499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mario Villamizar, Oscar Garces, Lina Ochoa, Harold E. Castro, Lorena Salamanca, Mauricio Verano, R. Casallas, Santiago Gil, Carlos Valencia, Angee Zambrano, Mery Lang
Large Internet companies like Amazon, Netflix, and LinkedIn are using the microservice architecture pattern to deploy large applications in the cloud as a set of small services that can be developed, tested, deployed, scaled, operated and upgraded independently. However, aside from gaining agility, independent development, and scalability, infrastructure costs are a major concern for companies adopting this pattern. This paper presents a cost comparison of a web application developed and deployed using the same scalable scenarios with three different approaches: 1) a monolithic architecture, 2) a microservice architecture operated by the cloud customer, and 3) a microservice architecture operated by the cloud provider. Test results show that microservices can help reduce infrastructure costs in comparison to standard monolithic architectures. Moreover, the use of services specifically designed to deploy and scale microservices reduces infrastructure costs by 70% or more. Lastly, we also describe the challenges we faced while implementing and deploying microservice applications.
{"title":"Infrastructure Cost Comparison of Running Web Applications in the Cloud Using AWS Lambda and Monolithic and Microservice Architectures","authors":"Mario Villamizar, Oscar Garces, Lina Ochoa, Harold E. Castro, Lorena Salamanca, Mauricio Verano, R. Casallas, Santiago Gil, Carlos Valencia, Angee Zambrano, Mery Lang","doi":"10.1109/CCGrid.2016.37","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.37","url":null,"abstract":"Large Internet companies like Amazon, Netflix, and LinkedIn are using the microservice architecture pattern to deploy large applications in the cloud as a set of small services that can be developed, tested, deployed, scaled, operated and upgraded independently. However, aside from gaining agility, independent development, and scalability, infrastructure costs are a major concern for companies adopting this pattern. This paper presents a cost comparison of a web application developed and deployed using the same scalable scenarios with three different approaches: 1) a monolithic architecture, 2) a microservice architecture operated by the cloud customer, and 3) a microservice architecture operated by the cloud provider. Test results show that microservices can help reduce infrastructure costs in comparison to standard monolithic architectures. Moreover, the use of services specifically designed to deploy and scale microservices reduces infrastructure costs by 70% or more. Lastly, we also describe the challenges we faced while implementing and deploying microservice applications.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127357529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Customizing server hardware to adapt to its workload has the potential to improve both runtime and energy efficiency. In a cluster that caters to diverse workloads, employing servers with customized hardware components leads to heterogeneity, which is not scalable. In this paper, we seek to create soft heterogeneity from existing servers with homogenous hardware components through customizing the firmware configuration. We demonstrate that firmware configurations have a large impact on runtime, power, and energy efficiency of workloads. Since finding the firmware configuration that minimizes runtime and/or energy efficiency grows exponentially as a function of the number of firmware settings, we propose a methodology called FXplore that helps complete the exploration with a quadratic time complexity. Furthermore, FXplore enables system administrators to manage the degree of the heterogeneity by deriving firmware configurations for sub-clusters that can cater to multiple workloads with similar characteristics. Thus, during online operation, incoming workloads to the cluster can be mapped to appropriate sub-clusters with pre-configured firmware settings. FXplore also finds the best firmware settings in case of co-runners on the same server. We validate our methodology on a fully-instrumented cluster under a large range of parallel workloads that are representative of both high-performance compute clusters and datacenters. Compared to enabling all firmware options, our method improves average runtime and energy consumption by 11% and 15%, respectively.
{"title":"Creating Soft Heterogeneity in Clusters Through Firmware Re-configuration","authors":"Xin Zhan, M. Shoaib, S. Reda","doi":"10.1109/CCGrid.2016.92","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.92","url":null,"abstract":"Customizing server hardware to adapt to its workload has the potential to improve both runtime and energy efficiency. In a cluster that caters to diverse workloads, employing servers with customized hardware components leads to heterogeneity, which is not scalable. In this paper, we seek to create soft heterogeneity from existing servers with homogenous hardware components through customizing the firmware configuration. We demonstrate that firmware configurations have a large impact on runtime, power, and energy efficiency of workloads. Since finding the firmware configuration that minimizes runtime and/or energy efficiency grows exponentially as a function of the number of firmware settings, we propose a methodology called FXplore that helps complete the exploration with a quadratic time complexity. Furthermore, FXplore enables system administrators to manage the degree of the heterogeneity by deriving firmware configurations for sub-clusters that can cater to multiple workloads with similar characteristics. Thus, during online operation, incoming workloads to the cluster can be mapped to appropriate sub-clusters with pre-configured firmware settings. FXplore also finds the best firmware settings in case of co-runners on the same server. We validate our methodology on a fully-instrumented cluster under a large range of parallel workloads that are representative of both high-performance compute clusters and datacenters. Compared to enabling all firmware options, our method improves average runtime and energy consumption by 11% and 15%, respectively.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132098187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cristian Ruiz, Joseph Emeras, E. Jeanvoine, L. Nussbaum
The era of Exascale computing raises new challenges for HPC. Intrinsic characteristics of those extreme scale platforms bring energy and reliability issues. To cope with those constraints, applications will have to be more flexible in order to deal with platform geometry evolutions and unavoidable failures. Thus, to prepare for this upcoming era, a strong effort must be made on improving the HPC software stack. This work focuses on improving the study of a central part of the software stack, the HPC runtimes. To this end we propose a set of extensions to the Distem emulator that enable the evaluation of fault tolerance and load balancing mechanisms in such runtimes. Extensive experimentation showing the benefits of our approach has been performed with three HPC runtimes: Charm++, MPICH, and OpenMPI.
{"title":"Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation","authors":"Cristian Ruiz, Joseph Emeras, E. Jeanvoine, L. Nussbaum","doi":"10.1109/CCGrid.2016.35","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.35","url":null,"abstract":"The era of Exascale computing raises new challenges for HPC. Intrinsic characteristics of those extreme scale platforms bring energy and reliability issues. To cope with those constraints, applications will have to be more flexible in order to deal with platform geometry evolutions and unavoidable failures. Thus, to prepare for this upcoming era, a strong effort must be made on improving the HPC software stack. This work focuses on improving the study of a central part of the software stack, the HPC runtimes. To this end we propose a set of extensions to the Distem emulator that enable the evaluation of fault tolerance and load balancing mechanisms in such runtimes. Extensive experimentation showing the benefits of our approach has been performed with three HPC runtimes: Charm++, MPICH, and OpenMPI.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134434087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oscar H. Mondragon, P. Bridges, Scott Levy, Kurt B. Ferreira, Patrick M. Widener
Next-generation applications increasingly rely on in situ analytics to guide computation, reduce the amount of I/O performed, and perform other important tasks. Scheduling where and when to run analytics is challenging, however. This paper quantifies the costs and benefits of different approaches to scheduling applications and analytics on nodes in large-scale applications, including space sharing, uncoordinated time sharing, and gang scheduled time sharing.
{"title":"Scheduling In-Situ Analytics in Next-Generation Applications","authors":"Oscar H. Mondragon, P. Bridges, Scott Levy, Kurt B. Ferreira, Patrick M. Widener","doi":"10.1109/CCGrid.2016.42","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.42","url":null,"abstract":"Next-generation applications increasingly rely on in situ analytics to guide computation, reduce the amount of I/O performed, and perform other important tasks. Scheduling where and when to run analytics is challenging, however. This paper quantifies the costs and benefits of different approaches to scheduling applications and analytics on nodes in large-scale applications, including space sharing, uncoordinated time sharing, and gang scheduled time sharing.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115578735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harold E. Castro, Mario Villamizar, Oscar Garces, J. Perez, R. Caliz, Pedro F. Perez Arteaga
Many small and medium size research groups have limitations to execute their HPC workloads because of the need to buy, configure and maintain their own cluster or grid solutions. At the same time, some research groups have large infrastructures with low utilization levels due in part to the tools they offer to end users, that require that each end user configures complex and distributed environments. In this paper we present an effort between a private and public institution to offer scientific applications as a service taking advantage of an existing infrastructure to create a private IaaS using OpenStack, and offering scientific applications through a friendly user interface. This strategy facilitates that researchers can run their HPC workloads on a private cloud in a transparent way, hiding the complexities of distributed and scalable cloud environments. We show how this strategy may help to increase the utilization of infrastructures, how it allows end users to easily execute and share their applications through a SaaS marketplace, and how new applications can be configured and deployed using a PaaS platform.
{"title":"Facilitating the Execution of HPC Workloads in Colombia through the Integration of a Private IaaS and a Scientific PaaS/SaaS Marketplace","authors":"Harold E. Castro, Mario Villamizar, Oscar Garces, J. Perez, R. Caliz, Pedro F. Perez Arteaga","doi":"10.1109/CCGrid.2016.52","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.52","url":null,"abstract":"Many small and medium size research groups have limitations to execute their HPC workloads because of the need to buy, configure and maintain their own cluster or grid solutions. At the same time, some research groups have large infrastructures with low utilization levels due in part to the tools they offer to end users, that require that each end user configures complex and distributed environments. In this paper we present an effort between a private and public institution to offer scientific applications as a service taking advantage of an existing infrastructure to create a private IaaS using OpenStack, and offering scientific applications through a friendly user interface. This strategy facilitates that researchers can run their HPC workloads on a private cloud in a transparent way, hiding the complexities of distributed and scalable cloud environments. We show how this strategy may help to increase the utilization of infrastructures, how it allows end users to easily execute and share their applications through a SaaS marketplace, and how new applications can be configured and deployed using a PaaS platform.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114792345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many large-scale data analytics infrastructures are employed for a wide variety of jobs, ranging from short interactive queries to large data analysis jobs that may take hours or even days to complete. As a consequence, data-processing frameworks like MapReduce may have workloads consisting of jobs with heavy-tailed processing requirements. With such workloads, short jobs may experience slowdowns that are an order of magnitude larger than large jobs do, while the users may expect slowdowns that are more in proportion with the job sizes. To address this problem of large job slowdown variability in MapReduce frameworks, we design a scheduling system called TYREX that is inspired by the well-known TAGS task assignment policy in distributed-server systems. In particular, TYREX partitions the resources of a MapReduce framework, allowing any job running in any partition to read data stored on any machine, imposes runtime limits in the partitions, and successively executes parts of jobs in a work-conserving way in these partitions until they can run to completion. We develop a statistical model for dynamically setting the runtime limits that achieves near optimal job slowdown performance, and we empirically evaluate TYREX on a cluster system with workloads consisting of both synthetic and real-world benchmarks. We find that TYREX cuts in half the job slowdown variability while preserving the median job slowdown when compared to state-of-the-art MapReduce schedulers such as FIFO and FAIR. Furthermore, TYREX reduces the job slowdown at the 95th percentile by more than 50% when compared to FIFO and by 20-40% when compared to FAIR.
{"title":"Tyrex: Size-Based Resource Allocation in MapReduce Frameworks","authors":"Bogdan Ghit, D. Epema","doi":"10.1109/CCGrid.2016.82","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.82","url":null,"abstract":"Many large-scale data analytics infrastructures are employed for a wide variety of jobs, ranging from short interactive queries to large data analysis jobs that may take hours or even days to complete. As a consequence, data-processing frameworks like MapReduce may have workloads consisting of jobs with heavy-tailed processing requirements. With such workloads, short jobs may experience slowdowns that are an order of magnitude larger than large jobs do, while the users may expect slowdowns that are more in proportion with the job sizes. To address this problem of large job slowdown variability in MapReduce frameworks, we design a scheduling system called TYREX that is inspired by the well-known TAGS task assignment policy in distributed-server systems. In particular, TYREX partitions the resources of a MapReduce framework, allowing any job running in any partition to read data stored on any machine, imposes runtime limits in the partitions, and successively executes parts of jobs in a work-conserving way in these partitions until they can run to completion. We develop a statistical model for dynamically setting the runtime limits that achieves near optimal job slowdown performance, and we empirically evaluate TYREX on a cluster system with workloads consisting of both synthetic and real-world benchmarks. We find that TYREX cuts in half the job slowdown variability while preserving the median job slowdown when compared to state-of-the-art MapReduce schedulers such as FIFO and FAIR. Furthermore, TYREX reduces the job slowdown at the 95th percentile by more than 50% when compared to FIFO and by 20-40% when compared to FAIR.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125044956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anshuman Goswami, Yuan Tian, K. Schwan, F. Zheng, Jeffrey S. Young, M. Wolf, G. Eisenhauer, S. Klasky
In-situ analysis on the output data of scientific simulations has been made necessary by ever-growing output data volumes and increasing costs of data movement as supercomputing is moving towards exascale. With hardware accelerators like GPUs becoming increasingly common in high end machines, new opportunities arise to co-locate scientific simulations and online analysis performed on the scientific data generated by the simulations. However, the asynchronous nature of GPGPU programming models and the limited context-switching capabilities on the GPU pose challenges to co-locating the scientific simulation and analysis on the same GPU. This paper dives deeper into these challenges to understand how best to co-locate analysis with scientific simulations on the GPUs in HPC clusters. Specifically, our 'Landrush' approach to GPU sharing proposes a solution that utilizes idle cycles on the GPU to provide an improved time-to-answer, that is, the total time to run the scientific simulation and analysis of the generated data. Landrush is demonstrated with experimental results obtained from leadership high-end applications on ORNL's Titan supercomputer, which show that (i) GPU-based scientific simulations have varying degrees of idle cycles to afford useful analysis task co-location, and (ii) the inability to context switch on the GPU at instruction granularity can be overcome by careful control of the analysis kernel launches and software-controlled early completion of analysis kernel executions. Results show that Landrush is superior in terms of time-to-answer compared to serially running simulations followed by analysis or by relying on the GPU driver and hardwired thread dispatcher to run analysis concurrently on a single GPU.
{"title":"Landrush: Rethinking In-Situ Analysis for GPGPU Workflows","authors":"Anshuman Goswami, Yuan Tian, K. Schwan, F. Zheng, Jeffrey S. Young, M. Wolf, G. Eisenhauer, S. Klasky","doi":"10.1109/CCGrid.2016.58","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.58","url":null,"abstract":"In-situ analysis on the output data of scientific simulations has been made necessary by ever-growing output data volumes and increasing costs of data movement as supercomputing is moving towards exascale. With hardware accelerators like GPUs becoming increasingly common in high end machines, new opportunities arise to co-locate scientific simulations and online analysis performed on the scientific data generated by the simulations. However, the asynchronous nature of GPGPU programming models and the limited context-switching capabilities on the GPU pose challenges to co-locating the scientific simulation and analysis on the same GPU. This paper dives deeper into these challenges to understand how best to co-locate analysis with scientific simulations on the GPUs in HPC clusters. Specifically, our 'Landrush' approach to GPU sharing proposes a solution that utilizes idle cycles on the GPU to provide an improved time-to-answer, that is, the total time to run the scientific simulation and analysis of the generated data. Landrush is demonstrated with experimental results obtained from leadership high-end applications on ORNL's Titan supercomputer, which show that (i) GPU-based scientific simulations have varying degrees of idle cycles to afford useful analysis task co-location, and (ii) the inability to context switch on the GPU at instruction granularity can be overcome by careful control of the analysis kernel launches and software-controlled early completion of analysis kernel executions. Results show that Landrush is superior in terms of time-to-answer compared to serially running simulations followed by analysis or by relying on the GPU driver and hardwired thread dispatcher to run analysis concurrently on a single GPU.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126972014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video streams, either in form of on-demand streaming or live streaming, usually have to be converted (i.e., transcoded) based on the characteristics of clients' devices (e.g., spatial resolution, network bandwidth, and supported formats). Transcoding is a computationally expensive and time-consuming operation, therefore, streaming service providers currently store numerous transcoded versions of the same video to serve different types of client devices. Due to the expense of maintaining and upgrading storage and computing infrastructures, many streaming service providers (e.g., Netflix) recently are becoming reliant on cloud services. However, the challenge in utilizing cloud services for video transcoding is how to deploy cloud resources in a cost-efficient manner without any major impact on the quality of video streams. To address this challenge, in this paper, we present the Cloud-based Video Streaming Service (CVSS) architecture to transcode video streams in an on-demand manner. The architecture provides a platform for streaming service providers to utilize cloud resources in a cost-efficient manner and with respect to the Quality of Service (QoS) demands of video streams. In particular, the architecture includes a QoS-aware scheduling method to efficiently map video streams to cloud resources, and a cost-aware dynamic (i.e., elastic) resource provisioning policy that adapts the resource acquisition with respect to the video streaming QoS demands. Simulation results based on realistic cloud traces and with various workload conditions, demonstrate that the CVSS architecture can satisfy video streaming QoS demands and reduces the incurred cost of stream providers up to 70%.
{"title":"CVSS: A Cost-Efficient and QoS-Aware Video Streaming Using Cloud Services","authors":"Xiangbo Li, M. Salehi, M. Bayoumi, R. Buyya","doi":"10.1109/CCGrid.2016.49","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.49","url":null,"abstract":"Video streams, either in form of on-demand streaming or live streaming, usually have to be converted (i.e., transcoded) based on the characteristics of clients' devices (e.g., spatial resolution, network bandwidth, and supported formats). Transcoding is a computationally expensive and time-consuming operation, therefore, streaming service providers currently store numerous transcoded versions of the same video to serve different types of client devices. Due to the expense of maintaining and upgrading storage and computing infrastructures, many streaming service providers (e.g., Netflix) recently are becoming reliant on cloud services. However, the challenge in utilizing cloud services for video transcoding is how to deploy cloud resources in a cost-efficient manner without any major impact on the quality of video streams. To address this challenge, in this paper, we present the Cloud-based Video Streaming Service (CVSS) architecture to transcode video streams in an on-demand manner. The architecture provides a platform for streaming service providers to utilize cloud resources in a cost-efficient manner and with respect to the Quality of Service (QoS) demands of video streams. In particular, the architecture includes a QoS-aware scheduling method to efficiently map video streams to cloud resources, and a cost-aware dynamic (i.e., elastic) resource provisioning policy that adapts the resource acquisition with respect to the video streaming QoS demands. Simulation results based on realistic cloud traces and with various workload conditions, demonstrate that the CVSS architecture can satisfy video streaming QoS demands and reduces the incurred cost of stream providers up to 70%.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114931376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}