首页 > 最新文献

Proceedings of the Eleventh European Conference on Computer Systems最新文献

英文 中文
TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters TetriSched:动态异构集群中具有自适应计划提前的全局重调度
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901355
Alexey Tumanov, T. Zhu, J. Park, M. Kozuch, Mor Harchol-Balter, G. Ganger
TetriSched is a scheduler that works in tandem with a calendaring reservation system to continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including those with reservations and best-effort jobs) on each scheduling cycle. TetriSched leverages information supplied by the reservation system about jobs' deadlines and estimated runtimes to plan ahead in deciding whether to wait for a busy preferred resource type (e.g., machine with a GPU) or fall back to less preferred placement options. Plan-ahead affords significant flexibility in handling mis-estimates in job runtimes specified at reservation time. Integrated with the main reservation system in Hadoop YARN, TetriSched is experimentally shown to achieve significantly higher SLO attainment and cluster utilization than the best-configured YARN reservation and CapacityScheduler stack deployed on a real 256 node cluster.
TetriSched是一个调度器,它与日历预约系统协同工作,在每个调度周期中不断地重新评估所有待处理作业(包括那些有预约和最努力工作的作业)的近期调度计划。TetriSched利用预定系统提供的有关作业截止日期和估计运行时间的信息,提前计划决定是否等待繁忙的首选资源类型(例如,带有GPU的机器)或退回到较少首选的放置选项。提前计划在处理预定时指定的作业运行时的错误估计方面提供了极大的灵活性。与Hadoop YARN中的主预留系统集成,实验表明,与部署在256节点集群上的最佳配置YARN预留和CapacityScheduler堆栈相比,TetriSched可以实现更高的SLO和集群利用率。
{"title":"TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters","authors":"Alexey Tumanov, T. Zhu, J. Park, M. Kozuch, Mor Harchol-Balter, G. Ganger","doi":"10.1145/2901318.2901355","DOIUrl":"https://doi.org/10.1145/2901318.2901355","url":null,"abstract":"TetriSched is a scheduler that works in tandem with a calendaring reservation system to continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including those with reservations and best-effort jobs) on each scheduling cycle. TetriSched leverages information supplied by the reservation system about jobs' deadlines and estimated runtimes to plan ahead in deciding whether to wait for a busy preferred resource type (e.g., machine with a GPU) or fall back to less preferred placement options. Plan-ahead affords significant flexibility in handling mis-estimates in job runtimes specified at reservation time. Integrated with the main reservation system in Hadoop YARN, TetriSched is experimentally shown to achieve significantly higher SLO attainment and cluster utilization than the best-configured YARN reservation and CapacityScheduler stack deployed on a real 256 node cluster.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74954600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 176
The Linux scheduler: a decade of wasted cores Linux调度器:浪费了十年的内核
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901326
Jean-Pierre Lozi, Baptiste Lepers, Justin R. Funston, Fabien Gaud, Vivien Quéma, Alexandra Fedorova
As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database. The main contribution of this work is the discovery and analysis of these bugs and providing the fixes. Conventional testing techniques and debugging tools are ineffective at confirming or understanding this kind of bugs, because their symptoms are often evasive. To drive our investigation, we built new tools that check for violation of the invariant online and visualize scheduling activity. They are simple, easily portable across kernel versions, and run with a negligible overhead. We believe that making these tools part of the kernel developers' tool belt can help keep this type of bug at bay.
作为资源管理的核心部分,操作系统线程调度器必须保持以下简单不变的原则:确保在可用的内核上调度就绪的线程。虽然看起来很简单,但我们发现这个不变量在Linux中经常被破坏。当就绪的线程在运行队列中等待时,内核可能会空闲几秒钟。在我们的实验中,这些性能错误导致同步密集型科学应用程序的性能下降了许多倍,内核生成的延迟增加了13%,广泛使用的商业数据库的TPC-H吞吐量降低了14-23%。这项工作的主要贡献是发现和分析这些错误并提供修复。传统的测试技术和调试工具在确认或理解这类bug方面是无效的,因为它们的症状通常是难以捉摸的。为了推动我们的调查,我们构建了新的工具来检查是否违反了不变式在线和可视化的调度活动。它们很简单,很容易跨内核版本移植,并且运行开销很小。我们相信,让这些工具成为内核开发人员工具带的一部分,可以帮助防止这种类型的bug。
{"title":"The Linux scheduler: a decade of wasted cores","authors":"Jean-Pierre Lozi, Baptiste Lepers, Justin R. Funston, Fabien Gaud, Vivien Quéma, Alexandra Fedorova","doi":"10.1145/2901318.2901326","DOIUrl":"https://doi.org/10.1145/2901318.2901326","url":null,"abstract":"As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database. The main contribution of this work is the discovery and analysis of these bugs and providing the fixes. Conventional testing techniques and debugging tools are ineffective at confirming or understanding this kind of bugs, because their symptoms are often evasive. To drive our investigation, we built new tools that check for violation of the invariant online and visualize scheduling activity. They are simple, easily portable across kernel versions, and run with a negligible overhead. We believe that making these tools part of the kernel developers' tool belt can help keep this type of bug at bay.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80598621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 146
An efficient design and implementation of LSM-tree based key-value store on open-channel SSD 基于lsm树的键值存储在开放通道SSD上的高效设计与实现
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592804
Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, J. Cong
Various key-value (KV) stores are widely employed for data management to support Internet services as they offer higher efficiency, scalability, and availability than relational database systems. The log-structured merge tree (LSM-tree) based KV stores have attracted growing attention because they can eliminate random writes and maintain acceptable read performance. Recently, as the price per unit capacity of NAND flash decreases, solid state disks (SSDs) have been extensively adopted in enterprise-scale data centers to provide high I/O bandwidth and low access latency. However, it is inefficient to naively combine LSM-tree-based KV stores with SSDs, as the high parallelism enabled within the SSD cannot be fully exploited. Current LSM-tree-based KV stores are designed without assuming SSD's multi-channel architecture. To address this inadequacy, we propose LOCS, a system equipped with a customized SSD design, which exposes its internal flash channels to applications, to work with the LSM-tree-based KV store, specifically LevelDB in this work. We extend LevelDB to explicitly leverage the multiple channels of an SSD to exploit its abundant parallelism. In addition, we optimize scheduling and dispatching polices for concurrent I/O requests to further improve the efficiency of data access. Compared with the scenario where a stock LevelDB runs on a conventional SSD, the throughput of storage system can be improved by more than 4X after applying all proposed optimization techniques.
各种键值(KV)存储被广泛用于数据管理以支持Internet服务,因为它们比关系数据库系统提供更高的效率、可伸缩性和可用性。基于日志结构合并树(LSM-tree)的KV存储由于能够消除随机写操作并保持良好的读性能而受到越来越多的关注。近年来,随着NAND闪存单位容量价格的下降,固态硬盘(ssd)被广泛应用于企业级数据中心,以提供高I/O带宽和低访问延迟。然而,天真地将基于lsm树的KV存储与SSD结合起来是低效的,因为SSD内部启用的高并行性不能被充分利用。当前基于lsm树的KV存储在设计时没有考虑SSD的多通道架构。为了解决这一不足,我们提出了LOCS,这是一个配备定制SSD设计的系统,它将其内部闪存通道暴露给应用程序,与基于lsm树的KV存储一起工作,特别是在本工作中的LevelDB。我们扩展LevelDB来显式地利用SSD的多个通道来利用其丰富的并行性。此外,我们优化了并发I/O请求的调度策略,进一步提高了数据访问效率。与在传统SSD上运行存量LevelDB的场景相比,应用所有优化技术后,存储系统的吞吐量可以提高4倍以上。
{"title":"An efficient design and implementation of LSM-tree based key-value store on open-channel SSD","authors":"Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, J. Cong","doi":"10.1145/2592798.2592804","DOIUrl":"https://doi.org/10.1145/2592798.2592804","url":null,"abstract":"Various key-value (KV) stores are widely employed for data management to support Internet services as they offer higher efficiency, scalability, and availability than relational database systems. The log-structured merge tree (LSM-tree) based KV stores have attracted growing attention because they can eliminate random writes and maintain acceptable read performance. Recently, as the price per unit capacity of NAND flash decreases, solid state disks (SSDs) have been extensively adopted in enterprise-scale data centers to provide high I/O bandwidth and low access latency. However, it is inefficient to naively combine LSM-tree-based KV stores with SSDs, as the high parallelism enabled within the SSD cannot be fully exploited. Current LSM-tree-based KV stores are designed without assuming SSD's multi-channel architecture.\u0000 To address this inadequacy, we propose LOCS, a system equipped with a customized SSD design, which exposes its internal flash channels to applications, to work with the LSM-tree-based KV store, specifically LevelDB in this work. We extend LevelDB to explicitly leverage the multiple channels of an SSD to exploit its abundant parallelism. In addition, we optimize scheduling and dispatching polices for concurrent I/O requests to further improve the efficiency of data access. Compared with the scenario where a stock LevelDB runs on a conventional SSD, the throughput of storage system can be improved by more than 4X after applying all proposed optimization techniques.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75700387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 180
Kronos: the design and implementation of an event ordering service Kronos:事件排序服务的设计和实现
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592822
Robert Escriva, Ayush Dubey, B. Wong, E. G. Sirer
This paper proposes a new approach to determining the order of interdependent operations in a distributed system. The key idea behind our approach is to factor the task of tracking happens-before relationships out of components that comprise the system, and to centralize them in a separate event ordering service. This not only simplifies implementation of individual components by freeing them from having to propagate dependence information, but also enables dependence relationships to be maintained across multiple independent systems. A novel API enables the system to detect and take advantage of concurrency whenever possible by maintaining fine-grained information and binding events to a time order as late as possible. We demonstrate the benefits of this approach through several example applications, including a transactional key-value store, and an online graph store. Experiments show that our event ordering service scales well and has low overhead in practice.
本文提出了一种确定分布式系统中相互依赖操作顺序的新方法。我们的方法背后的关键思想是将跟踪事件发生前关系的任务从组成系统的组件中分离出来,并将它们集中在单独的事件排序服务中。这不仅简化了单个组件的实现,使它们不必传播依赖信息,而且还支持跨多个独立系统维护依赖关系。一种新颖的API使系统能够通过维护细粒度信息和尽可能晚地将事件绑定到时间顺序来检测和利用并发性。我们通过几个示例应用程序(包括事务性键值存储和在线图存储)演示了这种方法的好处。实验表明,我们的事件排序服务具有良好的可扩展性和较低的开销。
{"title":"Kronos: the design and implementation of an event ordering service","authors":"Robert Escriva, Ayush Dubey, B. Wong, E. G. Sirer","doi":"10.1145/2592798.2592822","DOIUrl":"https://doi.org/10.1145/2592798.2592822","url":null,"abstract":"This paper proposes a new approach to determining the order of interdependent operations in a distributed system. The key idea behind our approach is to factor the task of tracking happens-before relationships out of components that comprise the system, and to centralize them in a separate event ordering service. This not only simplifies implementation of individual components by freeing them from having to propagate dependence information, but also enables dependence relationships to be maintained across multiple independent systems. A novel API enables the system to detect and take advantage of concurrency whenever possible by maintaining fine-grained information and binding events to a time order as late as possible. We demonstrate the benefits of this approach through several example applications, including a transactional key-value store, and an online graph store. Experiments show that our event ordering service scales well and has low overhead in practice.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83935526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Snapshots in a flash with ioSnap 使用ioSnap的flash快照
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592825
Sriram Subramanian, S. Sundararaman, Nisha Talagala, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Snapshots are a common and heavily relied upon feature in storage systems. The high performance of flash-based storage systems brings new, more stringent, requirements for this classic capability. We present ioSnap, a flash optimized snapshot system. Through careful design exploiting common snapshot usage patterns and flash oriented optimizations, including leveraging native characteristics of Flash Translation Layers, ioSnap delivers low-overhead snapshots with minimal disruption to foreground traffic. Through our evaluation, we show that ioSnap incurs negligible performance overhead during normal operation, and that common-case operations such as snapshot creation and deletion incur little cost. We also demonstrate techniques to mitigate the performance impact on foreground I/O during intensive snapshot operations such as activation. Overall, ioSnap represents a case study of how to integrate snapshots into a modern, well-engineered flash-based storage system.
快照是存储系统中常见且非常依赖的特性。基于闪存的高性能存储系统对这一经典功能提出了新的、更严格的要求。我们提出了ioSnap,一个flash优化的快照系统。通过精心设计,利用常见的快照使用模式和面向flash的优化,包括利用flash翻译层的本地特性,ioSnap提供了低开销的快照,对前景流量的干扰最小。通过我们的评估,我们表明ioSnap在正常操作期间产生的性能开销可以忽略不计,并且快照创建和删除等常见操作的开销很小。我们还演示了在密集快照操作(如激活)期间减轻对前台I/O性能影响的技术。总的来说,ioSnap代表了一个如何将快照集成到一个现代的、设计良好的基于闪存的存储系统中的案例研究。
{"title":"Snapshots in a flash with ioSnap","authors":"Sriram Subramanian, S. Sundararaman, Nisha Talagala, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/2592798.2592825","DOIUrl":"https://doi.org/10.1145/2592798.2592825","url":null,"abstract":"Snapshots are a common and heavily relied upon feature in storage systems. The high performance of flash-based storage systems brings new, more stringent, requirements for this classic capability. We present ioSnap, a flash optimized snapshot system. Through careful design exploiting common snapshot usage patterns and flash oriented optimizations, including leveraging native characteristics of Flash Translation Layers, ioSnap delivers low-overhead snapshots with minimal disruption to foreground traffic. Through our evaluation, we show that ioSnap incurs negligible performance overhead during normal operation, and that common-case operations such as snapshot creation and deletion incur little cost. We also demonstrate techniques to mitigate the performance impact on foreground I/O during intensive snapshot operations such as activation. Overall, ioSnap represents a case study of how to integrate snapshots into a modern, well-engineered flash-based storage system.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76773458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Algorithmic improvements for fast concurrent Cuckoo hashing 快速并发布谷鸟哈希算法改进
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592820
Xiaozhou Li, D. Andersen, M. Kaminsky, M. Freedman
Fast concurrent hash tables are an increasingly important building block as we scale systems to greater numbers of cores and threads. This paper presents the design, implementation, and evaluation of a high-throughput and memory-efficient concurrent hash table that supports multiple readers and writers. The design arises from careful attention to systems-level optimizations such as minimizing critical section length and reducing interprocessor coherence traffic through algorithm re-engineering. As part of the architectural basis for this engineering, we include a discussion of our experience and results adopting Intel's recent hardware transactional memory (HTM) support to this critical building block. We find that naively allowing concurrent access using a coarse-grained lock on existing data structures reduces overall performance with more threads. While HTM mitigates this slowdown somewhat, it does not eliminate it. Algorithmic optimizations that benefit both HTM and designs for fine-grained locking are needed to achieve high performance. Our performance results demonstrate that our new hash table design---based around optimistic cuckoo hashing---outperforms other optimized concurrent hash tables by up to 2.5x for write-heavy workloads, even while using substantially less memory for small key-value items. On a 16-core machine, our hash table executes almost 40 million insert and more than 70 million lookup operations per second.
当我们将系统扩展到更多的内核和线程时,快速并发哈希表是一个越来越重要的构建块。本文介绍了一个支持多个读写器的高吞吐量和内存高效并发哈希表的设计、实现和评估。该设计源于对系统级优化的仔细关注,例如通过算法重新设计最小化临界段长度和减少处理器间一致性流量。作为此工程的体系结构基础的一部分,我们讨论了采用英特尔最近的硬件事务性内存(HTM)支持这一关键构建块的经验和结果。我们发现,在现有数据结构上使用粗粒度锁天真地允许并发访问,会在线程更多的情况下降低整体性能。虽然HTM在一定程度上缓解了这种减速,但它并没有消除这种减速。为了实现高性能,需要对HTM和细粒度锁定设计都有利的算法优化。我们的性能结果表明,我们的新哈希表设计——基于乐观布谷鸟哈希——在写繁重的工作负载上比其他优化的并发哈希表性能高出2.5倍,即使是在为小键值项使用更少内存的情况下。在16核机器上,我们的哈希表每秒执行近4000万次插入操作和超过7000万次查找操作。
{"title":"Algorithmic improvements for fast concurrent Cuckoo hashing","authors":"Xiaozhou Li, D. Andersen, M. Kaminsky, M. Freedman","doi":"10.1145/2592798.2592820","DOIUrl":"https://doi.org/10.1145/2592798.2592820","url":null,"abstract":"Fast concurrent hash tables are an increasingly important building block as we scale systems to greater numbers of cores and threads. This paper presents the design, implementation, and evaluation of a high-throughput and memory-efficient concurrent hash table that supports multiple readers and writers. The design arises from careful attention to systems-level optimizations such as minimizing critical section length and reducing interprocessor coherence traffic through algorithm re-engineering. As part of the architectural basis for this engineering, we include a discussion of our experience and results adopting Intel's recent hardware transactional memory (HTM) support to this critical building block. We find that naively allowing concurrent access using a coarse-grained lock on existing data structures reduces overall performance with more threads. While HTM mitigates this slowdown somewhat, it does not eliminate it. Algorithmic optimizations that benefit both HTM and designs for fine-grained locking are needed to achieve high performance.\u0000 Our performance results demonstrate that our new hash table design---based around optimistic cuckoo hashing---outperforms other optimized concurrent hash tables by up to 2.5x for write-heavy workloads, even while using substantially less memory for small key-value items. On a 16-core machine, our hash table executes almost 40 million insert and more than 70 million lookup operations per second.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79772483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 161
Rex: replication at the speed of multi-core Rex:以多核速度复制
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592800
Zhenyu Guo, C. Hong, Mao Yang, Dong Zhou, Lidong Zhou, Li Zhuang
Standard state-machine replication involves consensus on a sequence of totally ordered requests through, for example, the Paxos protocol. Such a sequential execution model is becoming outdated on prevalent multi-core servers. Highly concurrent executions on multi-core architectures introduce non-determinism related to thread scheduling and lock contentions, and fundamentally break the assumption in state-machine replication. This tension between concurrency and consistency is not inherent because the total-ordering of requests is merely a simplifying convenience that is unnecessary for consistency. Concurrent executions of the application can be decoupled with a sequence of consensus decisions through consensus on partial-order traces, rather than on totally ordered requests, that capture the non-deterministic decisions in one replica execution and to be replayed with the same decisions on others. The result is a new multi-core friendly replicated state-machine framework that achieves strong consistency while preserving parallelism in multi-thread applications. On 12-core machines with hyper-threading, evaluations on typical applications show that we can scale with the number of cores, achieving up to 16 times the throughput of standard replicated state machines.
标准状态机复制涉及通过Paxos协议对一系列完全有序的请求达成一致。这种顺序执行模型在流行的多核服务器上已经过时了。多核架构上的高并发执行引入了与线程调度和锁争用相关的非确定性,并从根本上打破了状态机复制中的假设。并发性和一致性之间的紧张关系并不是固有的,因为请求的总排序仅仅是一种简化的便利,对于一致性来说是不必要的。应用程序的并发执行可以通过部分顺序的一致性跟踪(而不是完全顺序的请求)与一系列一致决策解耦,这些一致决策捕获了一个副本执行中的不确定性决策,并在其他副本执行中使用相同的决策重播。结果是一个新的多核友好的复制状态机框架,在保持多线程应用程序并行性的同时实现了强一致性。在具有超线程的12核机器上,对典型应用程序的评估表明,我们可以随着核心数量的增加而扩展,达到标准复制状态机吞吐量的16倍。
{"title":"Rex: replication at the speed of multi-core","authors":"Zhenyu Guo, C. Hong, Mao Yang, Dong Zhou, Lidong Zhou, Li Zhuang","doi":"10.1145/2592798.2592800","DOIUrl":"https://doi.org/10.1145/2592798.2592800","url":null,"abstract":"Standard state-machine replication involves consensus on a sequence of totally ordered requests through, for example, the Paxos protocol. Such a sequential execution model is becoming outdated on prevalent multi-core servers. Highly concurrent executions on multi-core architectures introduce non-determinism related to thread scheduling and lock contentions, and fundamentally break the assumption in state-machine replication. This tension between concurrency and consistency is not inherent because the total-ordering of requests is merely a simplifying convenience that is unnecessary for consistency. Concurrent executions of the application can be decoupled with a sequence of consensus decisions through consensus on partial-order traces, rather than on totally ordered requests, that capture the non-deterministic decisions in one replica execution and to be replayed with the same decisions on others. The result is a new multi-core friendly replicated state-machine framework that achieves strong consistency while preserving parallelism in multi-thread applications. On 12-core machines with hyper-threading, evaluations on typical applications show that we can scale with the number of cores, achieving up to 16 times the throughput of standard replicated state machines.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76424484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
Aerie: flexible file-system interfaces to storage-class memory Aerie:灵活的文件系统接口到存储类内存
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592810
Haris Volos, Sanketh Nalli, S. Panneerselvam, V. Varadarajan, Prashant Saxena, M. Swift
Storage-class memory technologies such as phase-change memory and memristors present a radically different interface to storage than existing block devices. As a result, they provide a unique opportunity to re-examine storage architectures. We find that the existing kernel-based stack of components, well suited for disks, unnecessarily limits the design and implementation of file systems for this new technology. We present Aerie, a flexible file-system architecture that exposes storage-class memory to user-mode programs so they can access files without kernel interaction. Aerie can implement a generic POSIX-like file system with performance similar to or better than a kernel implementation. The main benefit of Aerie, though, comes from enabling applications to optimize the file system interface. We demonstrate a specialized file system that reduces a hierarchical file system abstraction to a key/value store with fewer consistency guarantees but 20-109% higher performance than a kernel file system.
存储类存储技术,如相变存储器和忆阻器,提供了与现有块设备完全不同的存储接口。因此,它们为重新检查存储体系结构提供了独特的机会。我们发现,现有的基于内核的组件堆栈非常适合磁盘,但不必要地限制了这种新技术的文件系统的设计和实现。我们提出了Aerie,一个灵活的文件系统架构,它将存储类内存暴露给用户模式程序,这样它们就可以在没有内核交互的情况下访问文件。Aerie可以实现一个通用的类posix文件系统,其性能与内核实现相似或更好。不过,Aerie的主要优点在于使应用程序能够优化文件系统接口。我们演示了一个专门的文件系统,它将分层文件系统抽象为键/值存储,一致性保证较少,但性能比内核文件系统高20-109%。
{"title":"Aerie: flexible file-system interfaces to storage-class memory","authors":"Haris Volos, Sanketh Nalli, S. Panneerselvam, V. Varadarajan, Prashant Saxena, M. Swift","doi":"10.1145/2592798.2592810","DOIUrl":"https://doi.org/10.1145/2592798.2592810","url":null,"abstract":"Storage-class memory technologies such as phase-change memory and memristors present a radically different interface to storage than existing block devices. As a result, they provide a unique opportunity to re-examine storage architectures. We find that the existing kernel-based stack of components, well suited for disks, unnecessarily limits the design and implementation of file systems for this new technology.\u0000 We present Aerie, a flexible file-system architecture that exposes storage-class memory to user-mode programs so they can access files without kernel interaction. Aerie can implement a generic POSIX-like file system with performance similar to or better than a kernel implementation. The main benefit of Aerie, though, comes from enabling applications to optimize the file system interface. We demonstrate a specialized file system that reduces a hierarchical file system abstraction to a key/value store with fewer consistency guarantees but 20-109% higher performance than a kernel file system.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88508219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 200
DynMR: dynamic MapReduce with ReduceTask interleaving and MapTask backfilling DynMR:动态MapReduce,带有ReduceTask交错和MapTask回填
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592805
Jian Tan, Alicia Chin, Z. Z. Hu, Yonggang Hu, S. Meng, Xiaoqiao Meng, Li Zhang
In order to improve the performance of MapReduce, we design DynMR. It addresses the following problems that persist in the existing implementations: 1) difficulty in selecting optimal performance parameters for a single job in a fixed, dedicated environment, and lack of capability to configure parameters that can perform optimally in a dynamic, multi-job cluster; 2) long job execution resulting from a task long-tail effect, often caused by ReduceTask data skew or heterogeneous computing nodes; 3) inefficient use of hardware resources, since ReduceTasks bundle several functional phases together and may idle during certain phases. DynMR adaptively interleaves the execution of several partially-completed ReduceTasks and backfills MapTasks so that they run in the same JVM, one at a time. It consists of three components. 1) A running ReduceTask uses a detection algorithm to identify resource underutilization during the shuffle phase. It then gives up the allocated hardware resources efficiently to the next task. 2) A number of ReduceTasks are gradually assembled in a progressive queue, according to a flow control algorithm in runtime. These tasks execute in an interleaved rotation. Additional ReduceTasks can be inserted adaptively to the progressive queue if the full fetching capacity is not reached. MapTasks can be back-filled therein if it is still underused. 3) Merge threads of each ReduceTask are extracted out as standalone services within the associated JVM. This design allows the data segments of multiple partially-complete ReduceTasks to reside in the same JVM heap, controlled by a segment manager and served by the common merge threads. Experiments show 10% ~ 40% improvements, depending on the workload.
为了提高MapReduce的性能,我们设计了DynMR。它解决了现有实现中存在的以下问题:1)难以在固定的专用环境中为单个作业选择最佳性能参数,并且缺乏在动态的多作业集群中配置最佳性能参数的能力;2)由于任务长尾效应导致的作业执行时间过长,通常由ReduceTask数据倾斜或异构计算节点引起;3)硬件资源的低效使用,因为ReduceTasks将几个功能阶段捆绑在一起,并且可能在某些阶段闲置。DynMR自适应地交错执行几个部分完成的reducetask并回填maptask,以便它们在同一个JVM中运行,一次一个。它由三个部分组成。1)运行中的ReduceTask使用检测算法来识别shuffle阶段的资源利用率不足。然后,它有效地放弃分配给下一个任务的硬件资源。2)运行时根据流量控制算法,将多个reducetask逐渐组装成一个递进队列。这些任务以交错旋转的方式执行。如果没有达到完整的抓取容量,可以自适应地将额外的reducetask插入到渐进队列中。如果MapTasks未被充分利用,则可以在其中进行回填。3)每个ReduceTask的合并线程被提取出来作为相关JVM中的独立服务。这种设计允许多个部分完成的ReduceTasks的数据段驻留在同一个JVM堆中,由段管理器控制并由公共合并线程提供服务。实验表明,根据工作量的不同,改进幅度为10% ~ 40%。
{"title":"DynMR: dynamic MapReduce with ReduceTask interleaving and MapTask backfilling","authors":"Jian Tan, Alicia Chin, Z. Z. Hu, Yonggang Hu, S. Meng, Xiaoqiao Meng, Li Zhang","doi":"10.1145/2592798.2592805","DOIUrl":"https://doi.org/10.1145/2592798.2592805","url":null,"abstract":"In order to improve the performance of MapReduce, we design DynMR. It addresses the following problems that persist in the existing implementations: 1) difficulty in selecting optimal performance parameters for a single job in a fixed, dedicated environment, and lack of capability to configure parameters that can perform optimally in a dynamic, multi-job cluster; 2) long job execution resulting from a task long-tail effect, often caused by ReduceTask data skew or heterogeneous computing nodes; 3) inefficient use of hardware resources, since ReduceTasks bundle several functional phases together and may idle during certain phases.\u0000 DynMR adaptively interleaves the execution of several partially-completed ReduceTasks and backfills MapTasks so that they run in the same JVM, one at a time. It consists of three components. 1) A running ReduceTask uses a detection algorithm to identify resource underutilization during the shuffle phase. It then gives up the allocated hardware resources efficiently to the next task. 2) A number of ReduceTasks are gradually assembled in a progressive queue, according to a flow control algorithm in runtime. These tasks execute in an interleaved rotation. Additional ReduceTasks can be inserted adaptively to the progressive queue if the full fetching capacity is not reached. MapTasks can be back-filled therein if it is still underused. 3) Merge threads of each ReduceTask are extracted out as standalone services within the associated JVM. This design allows the data segments of multiple partially-complete ReduceTasks to reside in the same JVM heap, controlled by a segment manager and served by the common merge threads. Experiments show 10% ~ 40% improvements, depending on the workload.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86830590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
DIBS: just-in-time congestion mitigation for data centers DIBS:数据中心的实时拥塞缓解
Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592806
K. Zarifis, Rui Miao, Matt Calder, Ethan Katz-Bassett, Minlan Yu, J. Padhye
Data centers must support a range of workloads with differing demands. Although existing approaches handle routine traffic smoothly, intense hotspots--even if ephemeral--cause excessive packet loss and severely degrade performance. This loss occurs even though congestion is typically highly localized, with spare buffer capacity at nearby switches. In this paper, we argue that switches should share buffer capacity to effectively handle this spot congestion without the monetary hit of deploying large buffers at individual switches. Specifically, we present detour-induced buffer sharing (DIBS), a mechanism that achieves a near lossless network without requiring additional buffers at individual switches. Using DIBS, a congested switch detours packets randomly to neighboring switches to avoid dropping the packets. We implement DIBS in hardware, on software routers in a testbed, and in simulation, and we demonstrate that it reduces the 99th percentile of delay-sensitive query completion time by up to 85%, with very little impact on other traffic.
数据中心必须支持具有不同需求的一系列工作负载。尽管现有的方法可以平稳地处理日常流量,但强烈的热点(即使是短暂的)会导致大量的数据包丢失,并严重降低性能。即使拥塞通常是高度局部化的,在附近的交换机上有空闲的缓冲容量,这种损失也会发生。在本文中,我们认为交换机应该共享缓冲容量,以有效地处理这种点拥塞,而不会在单个交换机上部署大型缓冲区。具体来说,我们提出了绕道诱导缓冲区共享(DIBS),这是一种实现近乎无损网络的机制,而不需要在单个交换机上附加缓冲区。使用DIBS,拥塞交换机将数据包随机绕道到相邻交换机,以避免丢弃数据包。我们在硬件、测试平台中的软件路由器和仿真中实现了DIBS,并证明它将延迟敏感查询完成时间的第99百分位数减少了85%,对其他流量的影响很小。
{"title":"DIBS: just-in-time congestion mitigation for data centers","authors":"K. Zarifis, Rui Miao, Matt Calder, Ethan Katz-Bassett, Minlan Yu, J. Padhye","doi":"10.1145/2592798.2592806","DOIUrl":"https://doi.org/10.1145/2592798.2592806","url":null,"abstract":"Data centers must support a range of workloads with differing demands. Although existing approaches handle routine traffic smoothly, intense hotspots--even if ephemeral--cause excessive packet loss and severely degrade performance. This loss occurs even though congestion is typically highly localized, with spare buffer capacity at nearby switches. In this paper, we argue that switches should share buffer capacity to effectively handle this spot congestion without the monetary hit of deploying large buffers at individual switches. Specifically, we present detour-induced buffer sharing (DIBS), a mechanism that achieves a near lossless network without requiring additional buffers at individual switches. Using DIBS, a congested switch detours packets randomly to neighboring switches to avoid dropping the packets. We implement DIBS in hardware, on software routers in a testbed, and in simulation, and we demonstrate that it reduces the 99th percentile of delay-sensitive query completion time by up to 85%, with very little impact on other traffic.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87155641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
期刊
Proceedings of the Eleventh European Conference on Computer Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1