Proceedings of the Eleventh European Conference on Computer Systems最新文献

英文中文

RapiLog: reducing system complexity through verification RapiLog:通过验证降低系统复杂度

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465383

G. Heiser, Etienne Le Sueur, A. Danis, Aleksander Budzynowski, T. Salomie, G. Alonso

Database management systems provide updates with guaranteed durability in the presence of OS crashes or power failures. Durability is achieved by performing synchronous writes to a transaction log on stable, non-volatile storage. The procedure is expensive and several techniques have been devised to ameliorate the impact on overall performance at the cost of increased system complexity. In this paper we explore the possibility of reducing the system complexity around logging by leveraging verification instead of using specialised/dedicated hardware or complicated optimisations. The prototype system, RapiLog, uses a dependable hypervisor based on seL4 to buffer log data outside the database system and its OS, and performs the physical disk writes asynchronously with respect to the operation of the database. RapiLog guarantees that the log data will eventually be written to the disk even if the database system or the underlying OS crash or electrical power is cut. We evaluate RapiLog with multiple open-source and commercial database engines and find that performance is never degraded (beyond the virtualisation overhead), and at times is significantly improved.

数据库管理系统提供更新，在操作系统崩溃或电源故障的情况下保证其持久性。持久性是通过在稳定的非易失性存储上对事务日志执行同步写操作来实现的。该过程是昂贵的，已经设计了几种技术来改善对整体性能的影响，但代价是增加系统复杂性。在本文中，我们探讨了通过利用验证而不是使用专门的/专用的硬件或复杂的优化来降低日志系统复杂性的可能性。原型系统RapiLog使用基于seL4的可靠管理程序来缓冲数据库系统及其操作系统之外的日志数据，并根据数据库的操作异步执行物理磁盘写操作。RapiLog保证即使数据库系统或底层操作系统崩溃或断电，日志数据最终也会被写入磁盘。我们使用多个开源和商业数据库引擎对RapiLog进行了评估，发现性能从未下降(除了虚拟化开销之外)，有时还会显著提高。

{"title":"RapiLog: reducing system complexity through verification","authors":"G. Heiser, Etienne Le Sueur, A. Danis, Aleksander Budzynowski, T. Salomie, G. Alonso","doi":"10.1145/2465351.2465383","DOIUrl":"https://doi.org/10.1145/2465351.2465383","url":null,"abstract":"Database management systems provide updates with guaranteed durability in the presence of OS crashes or power failures. Durability is achieved by performing synchronous writes to a transaction log on stable, non-volatile storage. The procedure is expensive and several techniques have been devised to ameliorate the impact on overall performance at the cost of increased system complexity. In this paper we explore the possibility of reducing the system complexity around logging by leveraging verification instead of using specialised/dedicated hardware or complicated optimisations. The prototype system, RapiLog, uses a dependable hypervisor based on seL4 to buffer log data outside the database system and its OS, and performs the physical disk writes asynchronously with respect to the operation of the database. RapiLog guarantees that the log data will eventually be written to the disk even if the database system or the underlying OS crash or electrical power is cut. We evaluate RapiLog with multiple open-source and commercial database engines and find that performance is never degraded (beyond the virtualisation overhead), and at times is significantly improved.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"23 1","pages":"323-336"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75406986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Omega: flexible, scalable schedulers for large compute clusters Omega:用于大型计算集群的灵活、可伸缩的调度器

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465386

Malte Schwarzkopf, A. Konwinski, M. Abd-El-Malek, J. Wilkes

Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control. We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach -- all driven by real-life Google production workloads.

当前的单片集群调度器架构很难满足不断增长的规模和对不断变化的需求的快速响应需求。这限制了部署新特性的速度，降低了效率和利用率，最终将限制集群的增长。我们提出了一种使用并行性、共享状态和无锁乐观并发控制来解决这些需求的新方法。我们将这种方法与现有的集群调度器设计进行比较，评估调度器之间发生了多少干扰以及在实践中有多重要，提出了一些减轻干扰的技术，最后讨论了一个用例，突出了我们的方法的优点——所有这些都是由现实生活中的Google生产工作负载驱动的。

引用次数: 703

TimeStream: reliable stream computation in the cloud TimeStream:可靠的云端流计算

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465353

Zhengping Qian, Yong He, Chunzhi Su, Zhuojie Wu, Hongyu Zhu, Taizhi Zhang, Lidong Zhou, Yuan Yu, Zheng Zhang

TimeStream is a distributed system designed specifically for low-latency continuous processing of big streaming data on a large cluster of commodity machines. The unique characteristics of this emerging application domain have led to a significantly different design from the popular MapReduce-style batch data processing. In particular, we advocate a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model to handle failure recovery and dynamic reconfiguration in response to load changes. Several real-world applications running on our prototype have been shown to scale robustly with low latency while at the same time maintaining the simple and concise declarative programming model. TimeStream handles an on-line advertising aggregation pipeline at a rate of 700,000 URLs per second with a 2-second delay, while performing sentiment analysis of Twitter data at a peak rate close to 10,000 tweets per second, with approximately 2-second delay.

TimeStream是一个分布式系统，专为大型商用机器集群上的大数据流的低延迟连续处理而设计。这个新兴应用程序领域的独特特征导致了与流行的mapreduce风格的批处理数据处理明显不同的设计。特别是，我们提倡一种强大的新抽象，称为弹性替代，以满足新计算模型中处理故障恢复和响应负载变化的动态重新配置的特定需求。在我们的原型上运行的几个实际应用程序已经被证明可以在低延迟的情况下健壮地扩展，同时保持简单而简洁的声明性编程模型。TimeStream以每秒700,000个url的速率处理在线广告聚合管道，延迟2秒，同时以接近每秒10,000条tweet的峰值速率执行Twitter数据的情感分析，延迟约2秒。

引用次数: 252

Hypnos: understanding and treating sleep conflicts in smartphones 催眠:理解和治疗智能手机中的睡眠冲突

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465377

Abhilash Jindal, Abhinav Pathak, Y. C. Hu, S. Midkiff

To maximally conserve the critical resource of battery energy, smartphone OSes implement an aggressive system suspend policy that suspends the whole system after a brief period of user inactivity. This burdens developers with the responsibility of keeping the system on, or waking it up, to execute time-sensitive code. Developer mistakes in using the explicit power management unavoidably give rise to energy bugs, which cause significant, unexpected battery drain. In this paper, we study a new class of energy bugs, called sleep conflicts, which can happen in smartphone device drivers. Sleep conflict happens when a component in a high power state is unable to transition back to the base power state because the system is suspended when the device driver code responsible for driving the transition is supposed to execute. We illustrate the root cause of sleep conflicts, develop a classification of the four types of sleep conflicts, and finally present a runtime system that performs sleep conflict avoidance, along with a simple yet effective pre-deployment testing scheme. We have implemented and evaluated our system on two Android smartphones. Our testing scheme detects several sleep conflicts in WiFi and vibrator drivers, and our runtime avoidance scheme effectively prevents sleep conflicts from draining the battery.

为了最大限度地节省电池能源这一关键资源，智能手机操作系统实施了一种激进的系统暂停策略，即在用户短暂不活动后暂停整个系统。这给开发人员增加了负担，使他们有责任保持系统运行，或者唤醒系统，以执行对时间敏感的代码。开发人员在使用显式电源管理方面的错误不可避免地会导致能量错误，从而导致严重的，意想不到的电池消耗。在本文中，我们研究了一类新的能量错误，称为睡眠冲突，这可能发生在智能手机设备驱动程序中。当处于高功率状态的组件无法转换回基本功率状态时，就会发生睡眠冲突，因为当负责驱动转换的设备驱动程序代码应该执行时，系统被挂起。我们阐述了睡眠冲突的根本原因，对四种类型的睡眠冲突进行了分类，最后提出了一个执行睡眠冲突避免的运行时系统，以及一个简单而有效的部署前测试方案。我们已经在两款Android智能手机上执行并评估了我们的系统。我们的测试方案检测了WiFi和振动器驱动中的几种睡眠冲突，我们的运行时避免方案有效地防止了睡眠冲突耗尽电池。

{"title":"Hypnos: understanding and treating sleep conflicts in smartphones","authors":"Abhilash Jindal, Abhinav Pathak, Y. C. Hu, S. Midkiff","doi":"10.1145/2465351.2465377","DOIUrl":"https://doi.org/10.1145/2465351.2465377","url":null,"abstract":"To maximally conserve the critical resource of battery energy, smartphone OSes implement an aggressive system suspend policy that suspends the whole system after a brief period of user inactivity. This burdens developers with the responsibility of keeping the system on, or waking it up, to execute time-sensitive code. Developer mistakes in using the explicit power management unavoidably give rise to energy bugs, which cause significant, unexpected battery drain.\u0000 In this paper, we study a new class of energy bugs, called sleep conflicts, which can happen in smartphone device drivers. Sleep conflict happens when a component in a high power state is unable to transition back to the base power state because the system is suspended when the device driver code responsible for driving the transition is supposed to execute. We illustrate the root cause of sleep conflicts, develop a classification of the four types of sleep conflicts, and finally present a runtime system that performs sleep conflict avoidance, along with a simple yet effective pre-deployment testing scheme. We have implemented and evaluated our system on two Android smartphones. Our testing scheme detects several sleep conflicts in WiFi and vibrator drivers, and our runtime avoidance scheme effectively prevents sleep conflicts from draining the battery.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"3 1","pages":"253-266"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87976704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

MeT: workload aware elasticity for NoSQL MeT: NoSQL的工作负载感知弹性

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465370

F. Cruz, Francisco Maia, M. Matos, R. Oliveira, J. Paulo, J. Pereira, R. Vilaça

NoSQL databases manage the bulk of data produced by modern Web applications such as social networks. This stems from their ability to partition and spread data to all available nodes, allowing NoSQL systems to scale. Unfortunately, current solutions' scale out is oblivious to the underlying data access patterns, resulting in both highly skewed load across nodes and suboptimal node configurations. In this paper, we first show that judicious placement of HBase partitions taking into account data access patterns can improve overall throughput by 35%. Next, we go beyond current state of the art elastic systems limited to uninformed replica addition and removal by: i) reconfiguring existing replicas according to access patterns and ii) adding replicas specifically configured to the expected access pattern. MeT is a prototype for a Cloud-enabled framework that can be used alone or in conjunction with OpenStack for the automatic and heterogeneous reconfiguration of a HBase deployment. Our evaluation, conducted using the YCSB workload generator and a TPC-C workload, shows that MeT is able to i) autonomously achieve the performance of a manual configured cluster and ii) quickly reconfigure the cluster according to unpredicted workload changes.

NoSQL数据库管理由现代Web应用程序(如社交网络)产生的大量数据。这源于它们能够将数据分区并传播到所有可用节点，从而允许NoSQL系统进行扩展。不幸的是，当前解决方案的横向扩展忽略了底层的数据访问模式，从而导致节点之间的负载高度倾斜和节点配置不理想。在本文中，我们首先展示了考虑到数据访问模式而明智地放置HBase分区可以将总体吞吐量提高35%。接下来，我们将通过以下方式超越目前仅限于不知情副本添加和删除的最先进的弹性系统:i)根据访问模式重新配置现有副本;ii)添加专门配置为预期访问模式的副本。MeT是一个支持云的框架的原型，它可以单独使用，也可以与OpenStack结合使用，用于HBase部署的自动和异构重构。我们使用YCSB工作负载生成器和TPC-C工作负载进行的评估表明，MeT能够i)自主实现手动配置集群的性能，ii)根据不可预测的工作负载变化快速重新配置集群。

{"title":"MeT: workload aware elasticity for NoSQL","authors":"F. Cruz, Francisco Maia, M. Matos, R. Oliveira, J. Paulo, J. Pereira, R. Vilaça","doi":"10.1145/2465351.2465370","DOIUrl":"https://doi.org/10.1145/2465351.2465370","url":null,"abstract":"NoSQL databases manage the bulk of data produced by modern Web applications such as social networks. This stems from their ability to partition and spread data to all available nodes, allowing NoSQL systems to scale. Unfortunately, current solutions' scale out is oblivious to the underlying data access patterns, resulting in both highly skewed load across nodes and suboptimal node configurations.\u0000 In this paper, we first show that judicious placement of HBase partitions taking into account data access patterns can improve overall throughput by 35%. Next, we go beyond current state of the art elastic systems limited to uninformed replica addition and removal by: i) reconfiguring existing replicas according to access patterns and ii) adding replicas specifically configured to the expected access pattern.\u0000 MeT is a prototype for a Cloud-enabled framework that can be used alone or in conjunction with OpenStack for the automatic and heterogeneous reconfiguration of a HBase deployment. Our evaluation, conducted using the YCSB workload generator and a TPC-C workload, shows that MeT is able to i) autonomously achieve the performance of a manual configured cluster and ii) quickly reconfigure the cluster according to unpredicted workload changes.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"7 1","pages":"183-196"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90306430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

Composing OS extensions safely and efficiently with Bascule 使用bascle安全有效地组合操作系统扩展

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465375

Andrew Baumann, Dongyoon Lee, Pedro Fonseca, Lisa Glendenning, Jacob R. Lorch, Barry Bond, Reuben Olinsky, G. Hunt

Library OS (LibOS) architectures implement the OS personality as a user-mode library, giving each application the flexibility to choose its LibOS. This approach is appealing for many reasons, not least the ability to extend or customise the LibOS. Recent work with Drawbridge [29] showed that an existing commodity OS (Windows 7) could be refactored to produce a LibOS while retaining application compatibility. This paper presents Bascule, an architecture for LibOS extensions based on Drawbridge. Rather than relying on the application developer to customise a LibOS, Bascule allows OS-independent extensions to be attached at runtime. Extensions interpose on a narrow binary interface of primitive OS abstractions, such as files and virtual memory. Thus, they are independent of both guest and host OS, and composable at runtime. Since an extension runs in the same process as an application and its LibOS, it is safe and efficient. Bascule demonstrates extension reuse across diverse guest LibOSes (Windows and Linux) and host OSes (Windows and Barrelfish). Current extensions include file system translation, checkpointing, and architecture adaptation.

库操作系统(LibOS)体系结构将操作系统的特性实现为用户模式库，使每个应用程序能够灵活地选择自己的LibOS。这种方法吸引人的原因有很多，尤其是扩展或定制libo的能力。最近与Drawbridge的合作[29]表明，现有的商品操作系统(Windows 7)可以重构以产生LibOS，同时保持应用程序的兼容性。本文介绍了基于Drawbridge的LibOS扩展架构Bascule。Bascule允许在运行时附加与操作系统无关的扩展，而不是依赖于应用程序开发人员自定义LibOS。扩展插入到原始操作系统抽象(如文件和虚拟内存)的狭窄二进制接口上。因此，它们独立于客户机和主机操作系统，并且在运行时可组合。由于扩展与应用程序及其libo在相同的进程中运行，因此它是安全高效的。bascle演示了跨不同客户机操作系统(Windows和Linux)和主机操作系统(Windows和Barrelfish)的扩展重用。当前的扩展包括文件系统转换、检查点和体系结构适配。

{"title":"Composing OS extensions safely and efficiently with Bascule","authors":"Andrew Baumann, Dongyoon Lee, Pedro Fonseca, Lisa Glendenning, Jacob R. Lorch, Barry Bond, Reuben Olinsky, G. Hunt","doi":"10.1145/2465351.2465375","DOIUrl":"https://doi.org/10.1145/2465351.2465375","url":null,"abstract":"Library OS (LibOS) architectures implement the OS personality as a user-mode library, giving each application the flexibility to choose its LibOS. This approach is appealing for many reasons, not least the ability to extend or customise the LibOS. Recent work with Drawbridge [29] showed that an existing commodity OS (Windows 7) could be refactored to produce a LibOS while retaining application compatibility.\u0000 This paper presents Bascule, an architecture for LibOS extensions based on Drawbridge. Rather than relying on the application developer to customise a LibOS, Bascule allows OS-independent extensions to be attached at runtime. Extensions interpose on a narrow binary interface of primitive OS abstractions, such as files and virtual memory. Thus, they are independent of both guest and host OS, and composable at runtime. Since an extension runs in the same process as an application and its LibOS, it is safe and efficient.\u0000 Bascule demonstrates extension reuse across diverse guest LibOSes (Windows and Linux) and host OSes (Windows and Barrelfish). Current extensions include file system translation, checkpointing, and architecture adaptation.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"374 1","pages":"239-252"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77985937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Whose cache line is it anyway?: operating system support for live detection and repair of false sharing 它到底是谁的缓存线?:操作系统支持实时检测和修复虚假共享

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465366

Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, W. Aiello, A. Warfield

As hardware parallelism continues to increase, CPU caches can no longer be considered as a transparent, hardware-level performance optimization. Cache impact on performance, in particular in the face of false sharing, is completely dependent on the software that is executing. To effectively support parallel workloads on cache coherent hardware, the operating system must begin to treat the CPU cache like other shared hardware resources, and manage it appropriately. We demonstrate a prototype example of such support by describing Plastic, a software-based system that detects, diagnoses, and transparently repairs false sharing as it occurs in running applications. Plastic solves two challenging problems. First, it is capable of rapid, low-overhead detection and diagnosis of false sharing in unmodified, running applications. Second, it resolves identified instances of false sharing by providing a sub-page granularity memory remapping facility within the system. Our implementation is capable of identifying and repairing pathological false sharing in under one second of execution and achieves speedups of 3-6x on known examples of false sharing in parallel benchmarks.

随着硬件并行性的不断提高，CPU缓存不再被视为透明的硬件级性能优化。缓存对性能的影响，特别是在面对虚假共享时，完全取决于正在执行的软件。为了有效地支持缓存一致硬件上的并行工作负载，操作系统必须开始像对待其他共享硬件资源一样对待CPU缓存，并对其进行适当的管理。我们通过描述Plastic来展示这种支持的原型示例，Plastic是一个基于软件的系统，可以检测、诊断并透明地修复在运行的应用程序中发生的错误共享。塑料解决了两个难题。首先，它能够在未修改的运行应用程序中快速、低开销地检测和诊断错误共享。其次，它通过在系统内提供子页面粒度内存重新映射功能来解决已识别的错误共享实例。我们的实现能够在不到一秒的执行时间内识别和修复病态虚假共享，并在并行基准测试中对已知的虚假共享示例实现3-6倍的加速。

{"title":"Whose cache line is it anyway?: operating system support for live detection and repair of false sharing","authors":"Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, W. Aiello, A. Warfield","doi":"10.1145/2465351.2465366","DOIUrl":"https://doi.org/10.1145/2465351.2465366","url":null,"abstract":"As hardware parallelism continues to increase, CPU caches can no longer be considered as a transparent, hardware-level performance optimization. Cache impact on performance, in particular in the face of false sharing, is completely dependent on the software that is executing. To effectively support parallel workloads on cache coherent hardware, the operating system must begin to treat the CPU cache like other shared hardware resources, and manage it appropriately.\u0000 We demonstrate a prototype example of such support by describing Plastic, a software-based system that detects, diagnoses, and transparently repairs false sharing as it occurs in running applications. Plastic solves two challenging problems. First, it is capable of rapid, low-overhead detection and diagnosis of false sharing in unmodified, running applications. Second, it resolves identified instances of false sharing by providing a sub-page granularity memory remapping facility within the system. Our implementation is capable of identifying and repairing pathological false sharing in under one second of execution and achieves speedups of 3-6x on known examples of false sharing in parallel benchmarks.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"33 1","pages":"141-154"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76678992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Augustus: scalable and robust storage for cloud applications Augustus:用于云应用程序的可伸缩和健壮的存储

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465362

Ricardo Padilha, F. Pedone

Cloud-scale storage applications have strict requirements. On the one hand, they require scalable throughput; on the other hand, many applications would largely benefit from strong consistency. Since these requirements are sometimes considered contradictory, the subject has split the community with one side defending scalability at any cost (the "NoSQL" side), and the other side holding on time-proven transactional storage systems (the "SQL" side). In this paper, we present Augustus, a system that aims to bridge the sides by offering low-cost transactions with strong consistency and scalable throughput. Furthermore, Augustus assumes Byzantine failures to ensure data consistency even in the most hostile environments. We evaluated Augustus with a suite of micro-benchmarks, Buzzer (a Twitter-like service), and BFT Derby (an SQL engine based on Apache Derby).

云规模的存储应用有严格的要求。一方面，它们需要可扩展的吞吐量;另一方面，许多应用程序将在很大程度上受益于强一致性。由于这些需求有时被认为是相互矛盾的，因此该主题在社区中产生了分裂，一方不惜任何代价捍卫可伸缩性(“NoSQL”一方)，另一方坚持久经考验的事务性存储系统(“SQL”一方)。在本文中，我们介绍了Augustus，这是一个旨在通过提供具有强一致性和可扩展吞吐量的低成本交易来弥合双方的系统。此外，奥古斯都假定即使在最恶劣的环境中，拜占庭故障也能确保数据的一致性。我们使用一套微型基准测试、Buzzer(类似twitter的服务)和BFT Derby(基于Apache Derby的SQL引擎)对Augustus进行了评估。

引用次数: 20

RadixVM: scalable address spaces for multithreaded applications RadixVM:用于多线程应用程序的可伸缩地址空间

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465373

A. Clements, M. Kaashoek, N. Zeldovich

RadixVM is a new virtual memory system design that enables fully concurrent operations on shared address spaces for multithreaded processes on cache-coherent multicore computers. Today, most operating systems serialize operations such as mmap and munmap, which forces application developers to split their multithreaded applications into multiprocess applications, hoard memory to avoid the overhead of returning it, and so on. RadixVM removes this burden from application developers by ensuring that address space operations on non-overlapping memory regions scale perfectly. It does so by combining three techniques: 1) it organizes metadata in a radix tree instead of a balanced tree to avoid unnecessary cache line movement; 2) it uses a novel memory-efficient distributed reference counting scheme; and 3) it uses a new scheme to target remote TLB shootdowns and to often avoid them altogether. Experiments on an 80 core machine show that RadixVM achieves perfect scalability for non-overlapping regions: if several threads mmap or munmap pages in parallel, they can run completely independently and induce no cache coherence traffic.

RadixVM是一种新的虚拟内存系统设计，可以在缓存一致的多核计算机上对多线程进程的共享地址空间进行完全并发操作。如今，大多数操作系统都将mmap和munmap之类的操作序列化，这迫使应用程序开发人员将多线程应用程序拆分为多进程应用程序，囤积内存以避免返回内存的开销，等等。RadixVM通过确保在非重叠内存区域上的地址空间操作可以完美扩展，从而为应用程序开发人员消除了这种负担。它通过结合三种技术来做到这一点:1)它以基数树而不是平衡树来组织元数据，以避免不必要的缓存线移动;2)采用了一种新颖的高效内存分布式引用计数方案;3)它使用一种新的方案来瞄准远程TLB故障，并经常完全避免它们。在80核机器上的实验表明，RadixVM在非重叠区域上实现了完美的可伸缩性:如果多个线程并行mmap或munmap页面，它们可以完全独立运行，并且不会引起缓存一致性流量。

{"title":"RadixVM: scalable address spaces for multithreaded applications","authors":"A. Clements, M. Kaashoek, N. Zeldovich","doi":"10.1145/2465351.2465373","DOIUrl":"https://doi.org/10.1145/2465351.2465373","url":null,"abstract":"RadixVM is a new virtual memory system design that enables fully concurrent operations on shared address spaces for multithreaded processes on cache-coherent multicore computers. Today, most operating systems serialize operations such as mmap and munmap, which forces application developers to split their multithreaded applications into multiprocess applications, hoard memory to avoid the overhead of returning it, and so on. RadixVM removes this burden from application developers by ensuring that address space operations on non-overlapping memory regions scale perfectly. It does so by combining three techniques: 1) it organizes metadata in a radix tree instead of a balanced tree to avoid unnecessary cache line movement; 2) it uses a novel memory-efficient distributed reference counting scheme; and 3) it uses a new scheme to target remote TLB shootdowns and to often avoid them altogether. Experiments on an 80 core machine show that RadixVM achieves perfect scalability for non-overlapping regions: if several threads mmap or munmap pages in parallel, they can run completely independently and induce no cache coherence traffic.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"27 1","pages":"211-224"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89596517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 93

Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data 故障原子msync():一种简单而有效的机制，用于保持持久数据的完整性

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2013-04-15 DOI: 10.1145/2465351.2465374

Stan Park, T. Kelly, Kai Shen

Preserving the integrity of application data across updates is difficult if power outages and system crashes may occur during updates. Existing approaches such as relational databases and transactional key-value stores restrict programming flexibility by mandating narrow data access interfaces. We have designed, implemented, and evaluated an approach that strengthens the semantics of a standard operating system primitive while maintaining conceptual simplicity and supporting highly flexible programming: Failureatomic msync() commits changes to a memory-mapped file atomically, even in the presence of failures. Our Linux implementation of failure-atomic msync() has preserved application data integrity across hundreds of whole-machine power interruptions and exhibits good microbenchmark performance on both spinning disks and solid-state storage. Failure-atomic msync() supports higher layers of fully general programming abstraction, e.g., a persistent heap that easily slips beneath the C++ Standard Template Library. An STL built atop failure-atomic msync() outperforms several local key-value stores that support transactional updates. We integrated failure-atomic msync() into the Kyoto Tycoon key-value server by modifying exactly one line of code; our modified server reduces response times by 26--43% compared to Tycoon's existing transaction support while providing the same data integrity guarantees. Compared to a Tycoon server setup that makes almost no I/O (and therefore provides no support for data durability and integrity over failures), failure-atomic msync() incurs a three-fold response time increase on a fast Flash-based SSD---an acceptable cost of data reliability for many.

如果在更新期间可能发生断电和系统崩溃，那么在更新期间保持应用程序数据的完整性是很困难的。关系数据库和事务性键值存储等现有方法通过强制使用狭窄的数据访问接口限制了编程的灵活性。我们已经设计、实现并评估了一种方法，该方法在保持概念简单性和支持高度灵活编程的同时加强了标准操作系统原语的语义:Failureatomic msync()即使在存在故障的情况下也会自动地向内存映射文件提交更改。我们的故障原子msync() Linux实现在数百次整机电源中断中保持了应用程序数据的完整性，并在旋转磁盘和固态存储上显示了良好的微基准性能。失败原子msync()支持更高层的完全通用的编程抽象，例如，一个持久的堆，很容易溜到c++标准模板库之下。构建在故障原子msync()之上的STL优于支持事务性更新的几个本地键值存储。我们通过修改一行代码将故障原子msync()集成到Kyoto Tycoon键值服务器中;与Tycoon现有的事务支持相比，我们修改的服务器在提供相同的数据完整性保证的同时，将响应时间减少了26- 43%。与几乎没有I/O的Tycoon服务器设置(因此不支持数据持久性和故障完整性)相比，故障原子msync()在基于快速闪存的SSD上导致响应时间增加了三倍——对于许多人来说，这是可以接受的数据可靠性成本。

{"title":"Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data","authors":"Stan Park, T. Kelly, Kai Shen","doi":"10.1145/2465351.2465374","DOIUrl":"https://doi.org/10.1145/2465351.2465374","url":null,"abstract":"Preserving the integrity of application data across updates is difficult if power outages and system crashes may occur during updates. Existing approaches such as relational databases and transactional key-value stores restrict programming flexibility by mandating narrow data access interfaces. We have designed, implemented, and evaluated an approach that strengthens the semantics of a standard operating system primitive while maintaining conceptual simplicity and supporting highly flexible programming: Failureatomic msync() commits changes to a memory-mapped file atomically, even in the presence of failures. Our Linux implementation of failure-atomic msync() has preserved application data integrity across hundreds of whole-machine power interruptions and exhibits good microbenchmark performance on both spinning disks and solid-state storage. Failure-atomic msync() supports higher layers of fully general programming abstraction, e.g., a persistent heap that easily slips beneath the C++ Standard Template Library. An STL <map> built atop failure-atomic msync() outperforms several local key-value stores that support transactional updates. We integrated failure-atomic msync() into the Kyoto Tycoon key-value server by modifying exactly one line of code; our modified server reduces response times by 26--43% compared to Tycoon's existing transaction support while providing the same data integrity guarantees. Compared to a Tycoon server setup that makes almost no I/O (and therefore provides no support for data durability and integrity over failures), failure-atomic msync() incurs a three-fold response time increase on a fast Flash-based SSD---an acceptable cost of data reliability for many.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"41 1","pages":"225-238"},"PeriodicalIF":0.0,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82720360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Eleventh European Conference on Computer Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀