Hardware Lock Elision (HLE) represents a promising technique to enhance parallelism of concurrent applications relying on conventional, lock-based synchronization. The idea at the basis of current HLE approaches is to wrap critical sections into hardware transactions: this allows critical sections to be executed in parallel using a speculative approach, while leveraging on conflict detection capabilities provided by hardware transactions to ensure equivalent semantics to pessimistic lock-based synchronization. In this paper we present RW-LE, the first HLE approach targeting read-write locks. RW-LE introduces an innovative hardware-software co-design that exploits two recent micro-architectural features of POWER8 processors: suspending/resuming transaction execution and rollback-only transactions. RW-LE's original design provides two major benefits with respect to existing HLE techniques: i) eliding the read lock without resorting to the use of hardware transactions, and ii) avoiding to track read memory accesses issued in the write critical section. We evaluate RW-LE by means of an extensive experimental study based on a variety of benchmarks and real-life, complex applications. Our results demonstrate that RW-LE can provide striking performance gain of up to one order of magnitude with respect to state of the art HLE approaches.
{"title":"Hardware read-write lock elision","authors":"P. Felber, S. Issa, A. Matveev, P. Romano","doi":"10.1145/2901318.2901346","DOIUrl":"https://doi.org/10.1145/2901318.2901346","url":null,"abstract":"Hardware Lock Elision (HLE) represents a promising technique to enhance parallelism of concurrent applications relying on conventional, lock-based synchronization. The idea at the basis of current HLE approaches is to wrap critical sections into hardware transactions: this allows critical sections to be executed in parallel using a speculative approach, while leveraging on conflict detection capabilities provided by hardware transactions to ensure equivalent semantics to pessimistic lock-based synchronization. In this paper we present RW-LE, the first HLE approach targeting read-write locks. RW-LE introduces an innovative hardware-software co-design that exploits two recent micro-architectural features of POWER8 processors: suspending/resuming transaction execution and rollback-only transactions. RW-LE's original design provides two major benefits with respect to existing HLE techniques: i) eliding the read lock without resorting to the use of hardware transactions, and ii) avoiding to track read memory accesses issued in the write critical section. We evaluate RW-LE by means of an extensive experimental study based on a variety of benchmarks and real-life, complex applications. Our results demonstrate that RW-LE can provide striking performance gain of up to one order of magnitude with respect to state of the art HLE approaches.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78938106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, N. Sundaram, N. Satish, R. Sankaran, Jeffrey R. Jackson, K. Schwan
Memory-based data center applications require increasingly large memory capacities, but face the challenges posed by the inherent difficulties in scaling DRAM and also the cost of DRAM. Future systems are attempting to address these demands with heterogeneous memory architectures coupling DRAM with high capacity, low cost, but also lower performance, non-volatile memories (NVM) such as PCM and RRAM. A key usage model intended for NVM is as cheaper high capacity volatile memory. Data center operators are bound to ask whether this model for the usage of NVM to replace the majority of DRAM memory leads to a large slowdown in their applications? It is crucial to answer this question because a large performance impact will be an impediment to the adoption of such systems. This paper presents a thorough study of representative applications -- including a key-value store (MemC3), an in-memory database (VoltDB), and a graph analytics framework (GraphMat) -- on a platform that is capable of emulating a mix of memory technologies. Our conclusions are that it is indeed possible to use a mix of a small amount of fast DRAM and large amounts of slower NVM without a proportional impact to an application's performance. The caveat is that this result can only be achieved through careful placement of data structures. The contribution of this paper is the design and implementation of a set of libraries and automatic tools that enables programmers to achieve optimal data placement with minimal effort on their part. With such guided placement and with DRAM constituting only 6% of the total memory footprint for GraphMat and 25% for VoltDB and MemC3 (remaining memory is NVM with 4x higher latency and 8x lower bandwidth than DRAM), we show that our target applications demonstrate only a 13% to 40% slowdown. Without guided placement, these applications see, in the worst case, 1.5x to 5.9x slowdown on the same configuration. Based on a realistic assumption that NVM will be 5x cheaper (per bit) than DRAM, this hybrid solution also results in 2x to 2.8x better performance/$ than a DRAM-only system.
{"title":"Data tiering in heterogeneous memory systems","authors":"Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, N. Sundaram, N. Satish, R. Sankaran, Jeffrey R. Jackson, K. Schwan","doi":"10.1145/2901318.2901344","DOIUrl":"https://doi.org/10.1145/2901318.2901344","url":null,"abstract":"Memory-based data center applications require increasingly large memory capacities, but face the challenges posed by the inherent difficulties in scaling DRAM and also the cost of DRAM. Future systems are attempting to address these demands with heterogeneous memory architectures coupling DRAM with high capacity, low cost, but also lower performance, non-volatile memories (NVM) such as PCM and RRAM. A key usage model intended for NVM is as cheaper high capacity volatile memory. Data center operators are bound to ask whether this model for the usage of NVM to replace the majority of DRAM memory leads to a large slowdown in their applications? It is crucial to answer this question because a large performance impact will be an impediment to the adoption of such systems. This paper presents a thorough study of representative applications -- including a key-value store (MemC3), an in-memory database (VoltDB), and a graph analytics framework (GraphMat) -- on a platform that is capable of emulating a mix of memory technologies. Our conclusions are that it is indeed possible to use a mix of a small amount of fast DRAM and large amounts of slower NVM without a proportional impact to an application's performance. The caveat is that this result can only be achieved through careful placement of data structures. The contribution of this paper is the design and implementation of a set of libraries and automatic tools that enables programmers to achieve optimal data placement with minimal effort on their part. With such guided placement and with DRAM constituting only 6% of the total memory footprint for GraphMat and 25% for VoltDB and MemC3 (remaining memory is NVM with 4x higher latency and 8x lower bandwidth than DRAM), we show that our target applications demonstrate only a 13% to 40% slowdown. Without guided placement, these applications see, in the worst case, 1.5x to 5.9x slowdown on the same configuration. Based on a realistic assumption that NVM will be 5x cheaper (per bit) than DRAM, this hybrid solution also results in 2x to 2.8x better performance/$ than a DRAM-only system.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76422227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given the high cost of large-scale data centers, an important design goal is to fully utilize available power resources to maximize the computing capacity. In this paper we present Ampere, a novel power management system for data centers to increase the computing capacity by over-provisioning the number of servers. Instead of doing power capping that degrades the performance of running jobs, we use a statistical control approach to implement dynamic power management by indirectly affecting the workload scheduling, which can enormously reduce the risk of power violations. Instead of being a part of the already over-complicated scheduler, Ampere only interacts with the scheduler with two basic APIs. Instead of power control on the rack level, we impose power constraint on the row level, which leads to more room for over provisioning. We have implemented and deployed Ampere in our production data center. Controlled experiments on 400+ servers show that by adding 17% servers, we can increase the throughput of the data center by 15%, leading to significant cost savings while bringing no disturbances to the job performance.
{"title":"Increasing large-scale data center capacity by statistical power control","authors":"Guosai Wang, Shuhao Wang, Bing Luo, Weisong Shi, Yinghan Zhu, Wenjun Yang, Dianming Hu, Longbo Huang, Xin Jin, W. Xu","doi":"10.1145/2901318.2901338","DOIUrl":"https://doi.org/10.1145/2901318.2901338","url":null,"abstract":"Given the high cost of large-scale data centers, an important design goal is to fully utilize available power resources to maximize the computing capacity. In this paper we present Ampere, a novel power management system for data centers to increase the computing capacity by over-provisioning the number of servers. Instead of doing power capping that degrades the performance of running jobs, we use a statistical control approach to implement dynamic power management by indirectly affecting the workload scheduling, which can enormously reduce the risk of power violations. Instead of being a part of the already over-complicated scheduler, Ampere only interacts with the scheduler with two basic APIs. Instead of power control on the rack level, we impose power constraint on the row level, which leads to more room for over provisioning. We have implemented and deployed Ampere in our production data center. Controlled experiments on 400+ servers show that by adding 17% servers, we can increase the throughput of the data center by 15%, leading to significant cost savings while bringing no disturbances to the job performance.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85053115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, M. Vojnović, Sriram Rao
Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that maintaining queues of tasks at worker nodes has significant benefits. On one hand, centralized approaches do not use worker-side queues. Given the inherent feedback delays that these systems incur, they achieve suboptimal cluster utilization, particularly for workloads dominated by short tasks. On the other hand, distributed schedulers typically do employ worker-side queuing, and achieve higher cluster utilization. However, they fail to place tasks at the best possible machine, since they lack cluster-wide information, leading to worse job completion time, especially for heterogeneous workloads. To the best of our knowledge, this is the first work to provide principled solutions to the above problems by introducing queue management techniques, such as appropriate queue sizing, prioritization of task execution via queue reordering, starvation freedom, and careful placement of tasks to queues. We instantiate our techniques by extending both a centralized (YARN) and a distributed (Mercury) scheduler, and evaluate their performance on a wide variety of synthetic and production workloads derived from Microsoft clusters. Our centralized implementation, Yaq-c, achieves 1.7x improvement on median job completion time compared to YARN, and our distributed one, Yaq-d, achieves 9.3x improvement over an implementation of Sparrow's batch sampling on Mercury.
{"title":"Efficient queue management for cluster scheduling","authors":"Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, M. Vojnović, Sriram Rao","doi":"10.1145/2901318.2901354","DOIUrl":"https://doi.org/10.1145/2901318.2901354","url":null,"abstract":"Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that maintaining queues of tasks at worker nodes has significant benefits. On one hand, centralized approaches do not use worker-side queues. Given the inherent feedback delays that these systems incur, they achieve suboptimal cluster utilization, particularly for workloads dominated by short tasks. On the other hand, distributed schedulers typically do employ worker-side queuing, and achieve higher cluster utilization. However, they fail to place tasks at the best possible machine, since they lack cluster-wide information, leading to worse job completion time, especially for heterogeneous workloads. To the best of our knowledge, this is the first work to provide principled solutions to the above problems by introducing queue management techniques, such as appropriate queue sizing, prioritization of task execution via queue reordering, starvation freedom, and careful placement of tasks to queues. We instantiate our techniques by extending both a centralized (YARN) and a distributed (Mercury) scheduler, and evaluate their performance on a wide variety of synthetic and production workloads derived from Microsoft clusters. Our centralized implementation, Yaq-c, achieves 1.7x improvement on median job completion time compared to YARN, and our distributed one, Yaq-d, achieves 9.3x improvement over an implementation of Sparrow's batch sampling on Mercury.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82433209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present Crayon, a library and runtime system that reduces display power dissipation by acceptably approximating displayed images via shape and color transforms. Crayon can be inserted between an application and the display to optimize dynamically generated images before they appear on the screen. It can also be applied offline to optimize stored images before they are retrieved and displayed. Crayon exploits three fundamental properties: the acceptability of small changes in shape and color, the fact that the power dissipation of OLED displays and DLP pico-projectors is different for different colors, and the relatively small energy cost of computation in comparison to display energy usage. We implement and evaluate Crayon in three contexts: a hardware platform with detailed power measurement facilities and an OLED display, an Android tablet, and a set of cross-platform tools. Our results show that Crayon's color transforms can reduce display power dissipation by over 66% while producing images that remain visually acceptable to users. The measured whole-system power reduction is approximately 50%. We quantify the acceptability of Crayon's shape and color transforms with a user study involving over 400 participants and over 21,000 image evaluations.
{"title":"Crayon: saving power through shape and color approximation on next-generation displays","authors":"Phillip Stanley-Marbell, V. Estellers, M. Rinard","doi":"10.1145/2901318.2901347","DOIUrl":"https://doi.org/10.1145/2901318.2901347","url":null,"abstract":"We present Crayon, a library and runtime system that reduces display power dissipation by acceptably approximating displayed images via shape and color transforms. Crayon can be inserted between an application and the display to optimize dynamically generated images before they appear on the screen. It can also be applied offline to optimize stored images before they are retrieved and displayed. Crayon exploits three fundamental properties: the acceptability of small changes in shape and color, the fact that the power dissipation of OLED displays and DLP pico-projectors is different for different colors, and the relatively small energy cost of computation in comparison to display energy usage. We implement and evaluate Crayon in three contexts: a hardware platform with detailed power measurement facilities and an OLED display, an Android tablet, and a set of cross-platform tools. Our results show that Crayon's color transforms can reduce display power dissipation by over 66% while producing images that remain visually acceptable to users. The measured whole-system power reduction is approximately 50%. We quantify the acceptability of Crayon's shape and color transforms with a user study involving over 400 participants and over 21,000 image evaluations.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89291085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next-generation byte-addressable nonvolatile memories (NVMs), such as phase change memory (PCM) and Memristors, promise fast data storage, and more importantly, address DRAM scalability issues. State-of-the-art OS mechanisms for NVMs have focused on improving the block-based virtual file system (VFS) to manage both persistence and the memory capacity scaling needs of applications. However, using the VFS for capacity scaling has several limitations, such as the lack of automatic memory capacity scaling across DRAM and NVM, inefficient use of the processor cache and TLB, and high page access costs. These limitations reduce application performance and also impact applications that use NVM for persistent object storage with flat namespaces, such as photo stores, NoSQL databases, and others. To address such limitations, we propose persistent virtual memory (pVM), a system software abstraction that provides applications with (1) automatic OS-level memory capacity scaling, (2) flexible memory placement policies across NVM, and (3) fast object storage. pVM extends the OS virtual memory (VM) instead of building on the VFS and abstracts NVM as a NUMA node with support for NVM-based memory placement mechanisms. pVM inherits benefits from the cache and TLB-efficient VM subsystem and augments these further by distinguishing between persistent and nonpersistent capacity use of NVM. Additionally, pVM achieves fast persistent storage by further extending the VM subsystem with consistent and durable OS-level persistent metadata. Our evaluation of pVM with memory capacity-intensive applications shows a 2.5x speedup and up to 80% lower TLB and cache misses compared to VFS-based systems. pVM's object store provides 2x higher throughput compared to the block-based approach of the state-of-the art solution and up to a 4x reduction in the time spent in the OS.
{"title":"pVM: persistent virtual memory for efficient capacity scaling and object storage","authors":"Sudarsun Kannan, Ada Gavrilovska, K. Schwan","doi":"10.1145/2901318.2901325","DOIUrl":"https://doi.org/10.1145/2901318.2901325","url":null,"abstract":"Next-generation byte-addressable nonvolatile memories (NVMs), such as phase change memory (PCM) and Memristors, promise fast data storage, and more importantly, address DRAM scalability issues. State-of-the-art OS mechanisms for NVMs have focused on improving the block-based virtual file system (VFS) to manage both persistence and the memory capacity scaling needs of applications. However, using the VFS for capacity scaling has several limitations, such as the lack of automatic memory capacity scaling across DRAM and NVM, inefficient use of the processor cache and TLB, and high page access costs. These limitations reduce application performance and also impact applications that use NVM for persistent object storage with flat namespaces, such as photo stores, NoSQL databases, and others. To address such limitations, we propose persistent virtual memory (pVM), a system software abstraction that provides applications with (1) automatic OS-level memory capacity scaling, (2) flexible memory placement policies across NVM, and (3) fast object storage. pVM extends the OS virtual memory (VM) instead of building on the VFS and abstracts NVM as a NUMA node with support for NVM-based memory placement mechanisms. pVM inherits benefits from the cache and TLB-efficient VM subsystem and augments these further by distinguishing between persistent and nonpersistent capacity use of NVM. Additionally, pVM achieves fast persistent storage by further extending the VM subsystem with consistent and durable OS-level persistent metadata. Our evaluation of pVM with memory capacity-intensive applications shows a 2.5x speedup and up to 80% lower TLB and cache misses compared to VFS-based systems. pVM's object store provides 2x higher throughput compared to the block-based approach of the state-of-the art solution and up to a 4x reduction in the time spent in the OS.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83773900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud data centers operate at very low utilization rates resulting in significant energy waste. Oasis is a new approach for energy-oriented cluster management that enables dense server consolidation. Oasis achieves high consolidation ratios by combining traditional full VM migration with partial VM migration. Partial VM migration is used to densely consolidate the working sets of idle VMs by migrating on-demand only the pages that are accessed by the idle VMs to a consolidation host. Full VM migration is used to dynamically adapt the placement of VMs so that hosts are free from active VMs. Oasis sizes the cluster and saves energy by placing hosts without active VMs into sleep mode. It uses a low-power memory server design to allow the sleeping hosts to continue to service memory requests. In a simulated VDI server farm, our prototype saves energy by up to 28% on weekdays and 43% on weekends with minimal impact on the user productivity.
{"title":"Oasis: energy proportionality with hybrid server consolidation","authors":"Junji Zhi, Nilton Bila, E. D. Lara","doi":"10.1145/2901318.2901333","DOIUrl":"https://doi.org/10.1145/2901318.2901333","url":null,"abstract":"Cloud data centers operate at very low utilization rates resulting in significant energy waste. Oasis is a new approach for energy-oriented cluster management that enables dense server consolidation. Oasis achieves high consolidation ratios by combining traditional full VM migration with partial VM migration. Partial VM migration is used to densely consolidate the working sets of idle VMs by migrating on-demand only the pages that are accessed by the idle VMs to a consolidation host. Full VM migration is used to dynamically adapt the placement of VMs so that hosts are free from active VMs. Oasis sizes the cluster and saves energy by placing hosts without active VMs into sleep mode. It uses a low-power memory server design to allow the sleeping hosts to continue to service memory requests. In a simulated VDI server farm, our prototype saves energy by up to 28% on weekdays and 43% on weekends with minimal impact on the user productivity.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84731943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prateek Sharma, Tian Guo, Xin He, David E. Irwin, P. Shenoy
Cloud providers now offer transient servers, which they may revoke at anytime, for significantly lower prices than on-demand servers, which they cannot revoke. The low price of transient servers is particularly attractive for executing an emerging class of workload, which we call Batch-Interactive Data-Intensive (BIDI), that is becoming increasingly important for data analytics. BIDI workloads require large sets of servers to cache massive datasets in memory to enable low latency operation. In this paper, we illustrate the challenges of executing BIDI workloads on transient servers, where revocations (akin to failures) are the common case. To address these challenges, we design Flint, which is based on Spark and includes automated checkpointing and server selection policies that i) support batch and interactive applications and ii) dynamically adapt to application characteristics. We evaluate a prototype of Flint using EC2 spot instances, and show that it yields cost savings of up to 90% compared to using on-demand servers, while increasing running time by < 2%.
{"title":"Flint: batch-interactive data-intensive processing on transient servers","authors":"Prateek Sharma, Tian Guo, Xin He, David E. Irwin, P. Shenoy","doi":"10.1145/2901318.2901319","DOIUrl":"https://doi.org/10.1145/2901318.2901319","url":null,"abstract":"Cloud providers now offer transient servers, which they may revoke at anytime, for significantly lower prices than on-demand servers, which they cannot revoke. The low price of transient servers is particularly attractive for executing an emerging class of workload, which we call Batch-Interactive Data-Intensive (BIDI), that is becoming increasingly important for data analytics. BIDI workloads require large sets of servers to cache massive datasets in memory to enable low latency operation. In this paper, we illustrate the challenges of executing BIDI workloads on transient servers, where revocations (akin to failures) are the common case. To address these challenges, we design Flint, which is based on Spark and includes automated checkpointing and server selection policies that i) support batch and interactive applications and ii) dynamically adapt to application characteristics. We evaluate a prototype of Flint using EC2 spot instances, and show that it yields cost savings of up to 90% compared to using on-demand servers, while increasing running time by < 2%.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"139 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86266347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As systems continue to increase the number of cores within cache coherency domains, traditional techniques for enabling parallel computation on data-structures are increasingly strained. A single contended cache-line bouncing between different caches can prohibit continued performance gains with additional cores. New abstractions and mechanisms are required to reassess how data-structure consistency can be provided, while maintaining stable per-core access latencies. This paper presents the Parallel Sections (ParSec) abstraction for mediating access to shared data-structures. Fundamental to the approach is a new form of scalable memory reclamation that leverages fast local access to real-time to globally order system events. This approach attempts to minimize coherency-traffic, while harnessing the benefit of shared read-mostly cache-lines. We show that the co-management of scalable memory reclamation, memory allocation, locking, and namespace management enables scalable system service implementation. We apply ParSec to both memcached, and virtual memory management in a microkernel, and find order-of magnitude performance increases on a four socket, 40 core machine, and 30x lower 99th percentile latencies for virtual memory management.
{"title":"Parallel sections: scaling system-level data-structures","authors":"Qi Wang, Tim Stamler, Gabriel Parmer","doi":"10.1145/2901318.2901356","DOIUrl":"https://doi.org/10.1145/2901318.2901356","url":null,"abstract":"As systems continue to increase the number of cores within cache coherency domains, traditional techniques for enabling parallel computation on data-structures are increasingly strained. A single contended cache-line bouncing between different caches can prohibit continued performance gains with additional cores. New abstractions and mechanisms are required to reassess how data-structure consistency can be provided, while maintaining stable per-core access latencies. This paper presents the Parallel Sections (ParSec) abstraction for mediating access to shared data-structures. Fundamental to the approach is a new form of scalable memory reclamation that leverages fast local access to real-time to globally order system events. This approach attempts to minimize coherency-traffic, while harnessing the benefit of shared read-mostly cache-lines. We show that the co-management of scalable memory reclamation, memory allocation, locking, and namespace management enables scalable system service implementation. We apply ParSec to both memcached, and virtual memory management in a microkernel, and find order-of magnitude performance increases on a four socket, 40 core machine, and 30x lower 99th percentile latencies for virtual memory management.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77742267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unconventional computing platforms have spread widely and rapidly following smart phones and tablets: consumer electronics such as smart TVs and digital cameras. For such devices, fast booting is a critical requirement; waiting tens of seconds for a TV or a camera to boot up is not acceptable, unlike a PC or smart phone. Moreover, the software platforms of these devices have become as rich as conventional computing devices to provide comparable services. As a result, the booting procedure to start every required OS service, hardware component, and application, the quantity of which is ever increasing, may take unbearable time for most consumers. To accelerate booting, this paper introduces Booting Booster (BB), which is used in all 2015 Samsung Smart TV models, and which runs the Linux-based Tizen OS. BB addresses the init scheme of Linux, which launches initial user-space OS services and applications and manages the life cycles of all user processes, by identifying and isolating booting-critical tasks, deferring non-critical tasks, and enabling execution of more tasks in parallel. BB has been successfully deployed in Samsung Smart TV 2015 models achieving a cold boot in 3.5 s (compared to 8.1 s with full commercial-grade optimizations without BB) without the need for suspend-to-RAM or hibernation. After this successful deployment, we have released the source code via http://opensource.samsung.com/, and BB will be included in the open-source OS, Tizen (http://tizen.org/).
{"title":"BB: booting booster for consumer electronics with modern OS","authors":"Geunsik Lim, MyungJoo Ham","doi":"10.1145/2901318.2901320","DOIUrl":"https://doi.org/10.1145/2901318.2901320","url":null,"abstract":"Unconventional computing platforms have spread widely and rapidly following smart phones and tablets: consumer electronics such as smart TVs and digital cameras. For such devices, fast booting is a critical requirement; waiting tens of seconds for a TV or a camera to boot up is not acceptable, unlike a PC or smart phone. Moreover, the software platforms of these devices have become as rich as conventional computing devices to provide comparable services. As a result, the booting procedure to start every required OS service, hardware component, and application, the quantity of which is ever increasing, may take unbearable time for most consumers. To accelerate booting, this paper introduces Booting Booster (BB), which is used in all 2015 Samsung Smart TV models, and which runs the Linux-based Tizen OS. BB addresses the init scheme of Linux, which launches initial user-space OS services and applications and manages the life cycles of all user processes, by identifying and isolating booting-critical tasks, deferring non-critical tasks, and enabling execution of more tasks in parallel. BB has been successfully deployed in Samsung Smart TV 2015 models achieving a cold boot in 3.5 s (compared to 8.1 s with full commercial-grade optimizations without BB) without the need for suspend-to-RAM or hibernation. After this successful deployment, we have released the source code via http://opensource.samsung.com/, and BB will be included in the open-source OS, Tizen (http://tizen.org/).","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84588977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}