Marisabel Guevara, Benjamin Lubin, Benjamin C. Lee
Specialization of datacenter resources brings performance and energy improvements in response to the growing scale and diversity of cloud applications. Yet heterogeneous hardware adds complexity and volatility to latency-sensitive applications. A resource allocation mechanism that leverages architectural principles can overcome both of these obstacles. We integrate research in heterogeneous architectures with recent advances in multi-agent systems. Embedding architectural insight into proxies that bid on behalf of applications, a market effectively allocates hardware to applications with diverse preferences and valuations. Exploring a space of heterogeneous datacenter configurations, which mix server-class Xeon and mobile-class Atom processors, we find an optimal heterogeneous balance that improves both welfare and energy-efficiency. We further design and evaluate twelve design points along the Xeon-to-Atom spectrum, and find that a mix of three processor architectures achieves a 12× reduction in response time violations relative to equal-power homogeneous systems.
{"title":"Market mechanisms for managing datacenters with heterogeneous microarchitectures","authors":"Marisabel Guevara, Benjamin Lubin, Benjamin C. Lee","doi":"10.1145/2541258","DOIUrl":"https://doi.org/10.1145/2541258","url":null,"abstract":"Specialization of datacenter resources brings performance and energy improvements in response to the growing scale and diversity of cloud applications. Yet heterogeneous hardware adds complexity and volatility to latency-sensitive applications. A resource allocation mechanism that leverages architectural principles can overcome both of these obstacles.\u0000 We integrate research in heterogeneous architectures with recent advances in multi-agent systems. Embedding architectural insight into proxies that bid on behalf of applications, a market effectively allocates hardware to applications with diverse preferences and valuations. Exploring a space of heterogeneous datacenter configurations, which mix server-class Xeon and mobile-class Atom processors, we find an optimal heterogeneous balance that improves both welfare and energy-efficiency. We further design and evaluate twelve design points along the Xeon-to-Atom spectrum, and find that a mix of three processor architectures achieves a 12× reduction in response time violations relative to equal-power homogeneous systems.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"17 1","pages":"3:1-3:31"},"PeriodicalIF":1.5,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90515886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty of matching applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees that many cloud workloads require. While previous work has identified the impact of heterogeneity and interference, existing solutions are computationally intensive, cannot be applied online, and do not scale beyond a few applications. We present Paragon, an online and scalable DC scheduler that is heterogeneity- and interference-aware. Paragon is derived from robust analytical methods, and instead of profiling each application in detail, it leverages information the system already has about applications it has previously seen. It uses collaborative filtering techniques to quickly and accurately classify an unknown incoming workload with respect to heterogeneity and interference in multiple shared resources. It does so by identifying similarities to previously scheduled applications. The classification allows Paragon to greedily schedule applications in a manner that minimizes interference and maximizes server utilization. After the initial application placement, Paragon monitors application behavior and adjusts the scheduling decisions at runtime to avoid performance degradations. Additionally, we design ARQ, a multiclass admission control protocol that constrains application waiting time. ARQ queues applications in separate classes based on the type of resources they need and avoids long queueing delays for easy-to-satisfy workloads in highly-loaded scenarios. Paragon scales to tens of thousands of servers and applications with marginal scheduling overheads in terms of time or state. We evaluate Paragon with a wide range of workload scenarios, on both small and large-scale systems, including 1,000 servers on EC2. For a 2,500-workload scenario, Paragon enforces performance guarantees for 91% of applications, while significantly improving utilization. In comparison, heterogeneity-oblivious, interference-oblivious, and least-loaded schedulers only provide similar guarantees for 14%, 11%, and 3% of workloads. The differences are more striking in oversubscribed scenarios where resource efficiency is more critical.
{"title":"QoS-Aware scheduling in heterogeneous datacenters with paragon","authors":"Christina Delimitrou, C. Kozyrakis","doi":"10.1145/2556583","DOIUrl":"https://doi.org/10.1145/2556583","url":null,"abstract":"Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty of matching applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees that many cloud workloads require. While previous work has identified the impact of heterogeneity and interference, existing solutions are computationally intensive, cannot be applied online, and do not scale beyond a few applications.\u0000 We present Paragon, an online and scalable DC scheduler that is heterogeneity- and interference-aware. Paragon is derived from robust analytical methods, and instead of profiling each application in detail, it leverages information the system already has about applications it has previously seen. It uses collaborative filtering techniques to quickly and accurately classify an unknown incoming workload with respect to heterogeneity and interference in multiple shared resources. It does so by identifying similarities to previously scheduled applications. The classification allows Paragon to greedily schedule applications in a manner that minimizes interference and maximizes server utilization. After the initial application placement, Paragon monitors application behavior and adjusts the scheduling decisions at runtime to avoid performance degradations. Additionally, we design ARQ, a multiclass admission control protocol that constrains application waiting time. ARQ queues applications in separate classes based on the type of resources they need and avoids long queueing delays for easy-to-satisfy workloads in highly-loaded scenarios. Paragon scales to tens of thousands of servers and applications with marginal scheduling overheads in terms of time or state.\u0000 We evaluate Paragon with a wide range of workload scenarios, on both small and large-scale systems, including 1,000 servers on EC2. For a 2,500-workload scenario, Paragon enforces performance guarantees for 91% of applications, while significantly improving utilization. In comparison, heterogeneity-oblivious, interference-oblivious, and least-loaded schedulers only provide similar guarantees for 14%, 11%, and 3% of workloads. The differences are more striking in oversubscribed scenarios where resource efficiency is more critical.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"21 1","pages":"12"},"PeriodicalIF":1.5,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85967024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Balakrishnan, D. Malkhi, John D. Davis, Vijayan Prabhakaran, M. Wei, Ted Wobber
CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second. The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.
{"title":"CORFU: A distributed shared log","authors":"M. Balakrishnan, D. Malkhi, John D. Davis, Vijayan Prabhakaran, M. Wei, Ted Wobber","doi":"10.1145/2535930","DOIUrl":"https://doi.org/10.1145/2535930","url":null,"abstract":"CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second.\u0000 The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"42 1","pages":"10"},"PeriodicalIF":1.5,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81662992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen Smaldone, Benjamin Gilbert, J. Harkes, L. Iftode, M. Satyanarayanan
This article investigates the transient use of free local storage for improving performance in VM-based mobile computing systems implemented as thick clients on host PCs. We use the term TransientPC systems to refer to these types of systems. The solution we propose, called TransPart, uses the higher-performing local storage of host hardware to speed up performance-critical operations. Our solution constructs a virtual storage device on demand (which we call transient storage) by borrowing free disk blocks from the host’s storage. In this article, we present the design, implementation, and evaluation of a TransPart prototype, which requires no modifications to the software or hardware of a host computer. Experimental results confirm that TransPart offers low overhead and startup cost, while improving user experience.
{"title":"Optimizing Storage Performance for VM-Based Mobile Computing","authors":"Stephen Smaldone, Benjamin Gilbert, J. Harkes, L. Iftode, M. Satyanarayanan","doi":"10.1145/2465346.2465348","DOIUrl":"https://doi.org/10.1145/2465346.2465348","url":null,"abstract":"This article investigates the transient use of free local storage for improving performance in VM-based mobile computing systems implemented as thick clients on host PCs. We use the term TransientPC systems to refer to these types of systems. The solution we propose, called TransPart, uses the higher-performing local storage of host hardware to speed up performance-critical operations. Our solution constructs a virtual storage device on demand (which we call transient storage) by borrowing free disk blocks from the host’s storage. In this article, we present the design, implementation, and evaluation of a TransPart prototype, which requires no modifications to the software or hardware of a host computer. Experimental results confirm that TransPart offers low overhead and startup cost, while improving user experience.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"11 1","pages":"5"},"PeriodicalIF":1.5,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83839537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Content-based publish/subscribe (CPS) is an appealing abstraction for building scalable distributed systems, e.g., message boards, intrusion detectors, or algorithmic stock trading platforms. Recently, CPS extensions have been proposed for location-based services like vehicular networks, mobile social networking, and so on. Although current CPS middleware systems are dynamic in the way they support the joining and leaving of publishers and subscribers, they fall short in supporting subscription adaptations. These are becoming increasingly important across many CPS applications. In algorithmic high frequency trading, for instance, stock price thresholds that are of interest to a trader change rapidly, and gains directly hinge on the reaction time to relevant fluctuations rather than fixed values. In location-aware applications, a subscription is a function of the subscriber location (e.g. GPS coordinates), which inherently changes during motion. The common solution for adapting a subscription consists of a resubscription, where a new subscription is issued and the superseded one canceled. This incurs substantial overhead in CPS middleware systems, and leads to missed or duplicated events during the transition. In this article, we explore the concept of parametric subscriptions for capturing subscription adaptations. We discuss desirable and feasible guarantees for corresponding support, and propose novel algorithms for updating routing mechanisms effectively and efficiently in classic decentralized CPS broker overlay networks. Compared to resubscriptions, our algorithms significantly improve the reaction time to subscription updates without hampering throughput or latency under high update rates. We also propose and evaluate approximation techniques to detect and mitigate pathological cases of high frequency subscription oscillations, which could significantly decrease the throughput of CPS systems thereby affecting other subscribers. We analyze the benefits of our support through implementations of our algorithms in two CPS systems, and by evaluating our algorithms on two different application scenarios.
{"title":"Parametric Content-Based Publish/Subscribe","authors":"K. R. Jayaram, P. Eugster, C. Jayalath","doi":"10.1145/2465346.2465347","DOIUrl":"https://doi.org/10.1145/2465346.2465347","url":null,"abstract":"Content-based publish/subscribe (CPS) is an appealing abstraction for building scalable distributed systems, e.g., message boards, intrusion detectors, or algorithmic stock trading platforms. Recently, CPS extensions have been proposed for location-based services like vehicular networks, mobile social networking, and so on.\u0000 Although current CPS middleware systems are dynamic in the way they support the joining and leaving of publishers and subscribers, they fall short in supporting subscription adaptations. These are becoming increasingly important across many CPS applications. In algorithmic high frequency trading, for instance, stock price thresholds that are of interest to a trader change rapidly, and gains directly hinge on the reaction time to relevant fluctuations rather than fixed values. In location-aware applications, a subscription is a function of the subscriber location (e.g. GPS coordinates), which inherently changes during motion.\u0000 The common solution for adapting a subscription consists of a resubscription, where a new subscription is issued and the superseded one canceled. This incurs substantial overhead in CPS middleware systems, and leads to missed or duplicated events during the transition. In this article, we explore the concept of parametric subscriptions for capturing subscription adaptations. We discuss desirable and feasible guarantees for corresponding support, and propose novel algorithms for updating routing mechanisms effectively and efficiently in classic decentralized CPS broker overlay networks. Compared to resubscriptions, our algorithms significantly improve the reaction time to subscription updates without hampering throughput or latency under high update rates. We also propose and evaluate approximation techniques to detect and mitigate pathological cases of high frequency subscription oscillations, which could significantly decrease the throughput of CPS systems thereby affecting other subscribers.\u0000 We analyze the benefits of our support through implementations of our algorithms in two CPS systems, and by evaluating our algorithms on two different application scenarios.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"18 1","pages":"4"},"PeriodicalIF":1.5,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84012856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Rasmussen, G. Porter, Michael Conley, H. Madhyastha, Radhika Niranjan Mysore, A. Pucher, Amin Vahdat
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100TB of input data spread across 832 disks in 52 nodes at a rate of 0.938TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 66% better in absolute performance and has over six times the per-node throughput of the previous record holder. When evaluated against the 100TB Indy JouleSort benchmark, TritonSort sorted 9703 records/Joule. In this article, we describe the hardware and software architecture necessary to operate TritonSort at this level of efficiency. Through careful management of system resources to ensure cross-resource balance, we are able to sort data at approximately 80% of the disks’ aggregate sequential write speed. We believe the work holds a number of lessons for balanced system design and for scale-out architectures in general. While many interesting systems are able to scale linearly with additional servers, per-server performance can lag behind per-server capacity by more than an order of magnitude. Bridging the gap between high scalability and high performance would enable either significantly less expensive systems that are able to do the same work or provide the ability to address significantly larger problem sets with the same infrastructure.
{"title":"TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System","authors":"A. Rasmussen, G. Porter, Michael Conley, H. Madhyastha, Radhika Niranjan Mysore, A. Pucher, Amin Vahdat","doi":"10.1145/2427631.2427634","DOIUrl":"https://doi.org/10.1145/2427631.2427634","url":null,"abstract":"We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100TB of input data spread across 832 disks in 52 nodes at a rate of 0.938TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 66% better in absolute performance and has over six times the per-node throughput of the previous record holder. When evaluated against the 100TB Indy JouleSort benchmark, TritonSort sorted 9703 records/Joule. In this article, we describe the hardware and software architecture necessary to operate TritonSort at this level of efficiency. Through careful management of system resources to ensure cross-resource balance, we are able to sort data at approximately 80% of the disks’ aggregate sequential write speed.\u0000 We believe the work holds a number of lessons for balanced system design and for scale-out architectures in general. While many interesting systems are able to scale linearly with additional servers, per-server performance can lag behind per-server capacity by more than an order of magnitude. Bridging the gap between high scalability and high performance would enable either significantly less expensive systems that are able to do the same work or provide the ability to address significantly larger problem sets with the same infrastructure.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"10 1","pages":"3"},"PeriodicalIF":1.5,"publicationDate":"2013-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2427631.2427634","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72527510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sriram Govindan, Di Wang, A. Sivasubramaniam, B. Urgaonkar
Datacenters spend $10--25 per watt in provisioning their power infrastructure, regardless of the watts actually consumed. Since peak power needs arise rarely, provisioning power infrastructure for them can be expensive. One can, thus, aggressively underprovision infrastructure assuming that simultaneous peak draw across all equipment will happen rarely. The resulting nonzero probability of emergency events where power needs exceed provisioned capacity, however small, mandates graceful reaction mechanisms to cap the power draw instead of leaving it to disruptive circuit breakers/fuses. Existing strategies for power capping use temporal knobs local to a server that throttle the rate of execution (using power modes), and/or spatial knobs that redirect/migrate excess load to regions of the datacenter with more power headroom. We show these mechanisms to have performance degrading ramifications, and propose an entirely orthogonal solution that leverages existing UPS batteries to temporarily augment the utility supply during emergencies. We build an experimental prototype to demonstrate such power capping on a cluster of 8 servers, each with an individual battery, and implement several online heuristics in the context of different datacenter workloads to evaluate their effectiveness in handling power emergencies. We show that our battery-based solution can: (i) handle emergencies of short durations on its own, (ii) supplement existing reaction mechanisms to enhance their efficacy for longer emergencies, and (iii) create more slack for shifting applications temporarily to nonpeak durations.
{"title":"Aggressive Datacenter Power Provisioning with Batteries","authors":"Sriram Govindan, Di Wang, A. Sivasubramaniam, B. Urgaonkar","doi":"10.1145/2427631.2427633","DOIUrl":"https://doi.org/10.1145/2427631.2427633","url":null,"abstract":"Datacenters spend $10--25 per watt in provisioning their power infrastructure, regardless of the watts actually consumed. Since peak power needs arise rarely, provisioning power infrastructure for them can be expensive. One can, thus, aggressively underprovision infrastructure assuming that simultaneous peak draw across all equipment will happen rarely. The resulting nonzero probability of emergency events where power needs exceed provisioned capacity, however small, mandates graceful reaction mechanisms to cap the power draw instead of leaving it to disruptive circuit breakers/fuses. Existing strategies for power capping use temporal knobs local to a server that throttle the rate of execution (using power modes), and/or spatial knobs that redirect/migrate excess load to regions of the datacenter with more power headroom. We show these mechanisms to have performance degrading ramifications, and propose an entirely orthogonal solution that leverages existing UPS batteries to temporarily augment the utility supply during emergencies. We build an experimental prototype to demonstrate such power capping on a cluster of 8 servers, each with an individual battery, and implement several online heuristics in the context of different datacenter workloads to evaluate their effectiveness in handling power emergencies. We show that our battery-based solution can: (i) handle emergencies of short durations on its own, (ii) supplement existing reaction mechanisms to enhance their efficacy for longer emergencies, and (iii) create more slack for shifting applications temporarily to nonpeak durations.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"14 1","pages":"2"},"PeriodicalIF":1.5,"publicationDate":"2013-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86460420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edouard Bugnion, Scott Devine, M. Rosenblum, J. Sugerman, E. Wang
This article describes the historical context, technical challenges, and main implementation techniques used by VMware Workstation to bring virtualization to the x86 architecture in 1999. Although virtual machine monitors (VMMs) had been around for decades, they were traditionally designed as part of monolithic, single-vendor architectures with explicit support for virtualization. In contrast, the x86 architecture lacked virtualization support, and the industry around it had disaggregated into an ecosystem, with different vendors controlling the computers, CPUs, peripherals, operating systems, and applications, none of them asking for virtualization. We chose to build our solution independently of these vendors. As a result, VMware Workstation had to deal with new challenges associated with (i) the lack of virtualization support in the x86 architecture, (ii) the daunting complexity of the architecture itself, (iii) the need to support a broad combination of peripherals, and (iv) the need to offer a simple user experience within existing environments. These new challenges led us to a novel combination of well-known virtualization techniques, techniques from other domains, and new techniques. VMware Workstation combined a hosted architecture with a VMM. The hosted architecture enabled a simple user experience and offered broad hardware compatibility. Rather than exposing I/O diversity to the virtual machines, VMware Workstation also relied on software emulation of I/O devices. The VMM combined a trap-and-emulate direct execution engine with a system-level dynamic binary translator to efficiently virtualize the x86 architecture and support most commodity operating systems. By relying on x86 hardware segmentation as a protection mechanism, the binary translator could execute translated code at near hardware speeds. The binary translator also relied on partial evaluation and adaptive retranslation to reduce the overall overheads of virtualization. Written with the benefit of hindsight, this article shares the key lessons we learned from building the original system and from its later evolution.
{"title":"Bringing Virtualization to the x86 Architecture with the Original VMware Workstation","authors":"Edouard Bugnion, Scott Devine, M. Rosenblum, J. Sugerman, E. Wang","doi":"10.1145/2382553.2382554","DOIUrl":"https://doi.org/10.1145/2382553.2382554","url":null,"abstract":"This article describes the historical context, technical challenges, and main implementation techniques used by VMware Workstation to bring virtualization to the x86 architecture in 1999. Although virtual machine monitors (VMMs) had been around for decades, they were traditionally designed as part of monolithic, single-vendor architectures with explicit support for virtualization. In contrast, the x86 architecture lacked virtualization support, and the industry around it had disaggregated into an ecosystem, with different vendors controlling the computers, CPUs, peripherals, operating systems, and applications, none of them asking for virtualization. We chose to build our solution independently of these vendors.\u0000 As a result, VMware Workstation had to deal with new challenges associated with (i) the lack of virtualization support in the x86 architecture, (ii) the daunting complexity of the architecture itself, (iii) the need to support a broad combination of peripherals, and (iv) the need to offer a simple user experience within existing environments. These new challenges led us to a novel combination of well-known virtualization techniques, techniques from other domains, and new techniques.\u0000 VMware Workstation combined a hosted architecture with a VMM. The hosted architecture enabled a simple user experience and offered broad hardware compatibility. Rather than exposing I/O diversity to the virtual machines, VMware Workstation also relied on software emulation of I/O devices. The VMM combined a trap-and-emulate direct execution engine with a system-level dynamic binary translator to efficiently virtualize the x86 architecture and support most commodity operating systems. By relying on x86 hardware segmentation as a protection mechanism, the binary translator could execute translated code at near hardware speeds. The binary translator also relied on partial evaluation and adaptive retranslation to reduce the overall overheads of virtualization.\u0000 Written with the benefit of hindsight, this article shares the key lessons we learned from building the original system and from its later evolution.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"34 1","pages":"12:1-12:51"},"PeriodicalIF":1.5,"publicationDate":"2012-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90911112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ú. Erlingsson, Marcus Peinado, Simon Peter, M. Budiu, Gloria Mainar-Ruiz
Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events. We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks. Fay shows that modern techniques for high-level querying and data-parallel processing of disagreggated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960s [Deutsch and Grant 1971], Fay also demonstrates the efficiency and extensibility benefits of using safe, statically verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.
{"title":"Fay: Extensible Distributed Tracing from Kernels to Clusters","authors":"Ú. Erlingsson, Marcus Peinado, Simon Peter, M. Budiu, Gloria Mainar-Ruiz","doi":"10.1145/2382553.2382555","DOIUrl":"https://doi.org/10.1145/2382553.2382555","url":null,"abstract":"Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events.\u0000 We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks.\u0000 Fay shows that modern techniques for high-level querying and data-parallel processing of disagreggated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960s [Deutsch and Grant 1971], Fay also demonstrates the efficiency and extensibility benefits of using safe, statically verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"60 1","pages":"13:1-13:35"},"PeriodicalIF":1.5,"publicationDate":"2012-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90705185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anshul Gandhi, Mor Harchol-Balter, R. Raghunathan, M. Kozuch
Energy costs for data centers continue to rise, already exceeding $15 billion yearly. Sadly much of this power is wasted. Servers are only busy 10--30% of the time on average, but they are often left on, while idle, utilizing 60% or more of peak power when in the idle state. We introduce a dynamic capacity management policy, AutoScale, that greatly reduces the number of servers needed in data centers driven by unpredictable, time-varying load, while meeting response time SLAs. AutoScale scales the data center capacity, adding or removing servers as needed. AutoScale has two key features: (i) it autonomically maintains just the right amount of spare capacity to handle bursts in the request rate; and (ii) it is robust not just to changes in the request rate of real-world traces, but also request size and server efficiency. We evaluate our dynamic capacity management approach via implementation on a 38-server multi-tier data center, serving a web site of the type seen in Facebook or Amazon, with a key-value store workload. We demonstrate that AutoScale vastly improves upon existing dynamic capacity management policies with respect to meeting SLAs and robustness.
{"title":"AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers","authors":"Anshul Gandhi, Mor Harchol-Balter, R. Raghunathan, M. Kozuch","doi":"10.1145/2382553.2382556","DOIUrl":"https://doi.org/10.1145/2382553.2382556","url":null,"abstract":"Energy costs for data centers continue to rise, already exceeding $15 billion yearly. Sadly much of this power is wasted. Servers are only busy 10--30% of the time on average, but they are often left on, while idle, utilizing 60% or more of peak power when in the idle state.\u0000 We introduce a dynamic capacity management policy, AutoScale, that greatly reduces the number of servers needed in data centers driven by unpredictable, time-varying load, while meeting response time SLAs. AutoScale scales the data center capacity, adding or removing servers as needed. AutoScale has two key features: (i) it autonomically maintains just the right amount of spare capacity to handle bursts in the request rate; and (ii) it is robust not just to changes in the request rate of real-world traces, but also request size and server efficiency.\u0000 We evaluate our dynamic capacity management approach via implementation on a 38-server multi-tier data center, serving a web site of the type seen in Facebook or Amazon, with a key-value store workload. We demonstrate that AutoScale vastly improves upon existing dynamic capacity management policies with respect to meeting SLAs and robustness.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"4 1","pages":"14:1-14:26"},"PeriodicalIF":1.5,"publicationDate":"2012-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73616765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}