Proceedings of the Eleventh European Conference on Computer Systems最新文献_第2页

Hardware read-write lock elision 硬件读写锁省略

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901346

P. Felber, S. Issa, A. Matveev, P. Romano

Hardware Lock Elision (HLE) represents a promising technique to enhance parallelism of concurrent applications relying on conventional, lock-based synchronization. The idea at the basis of current HLE approaches is to wrap critical sections into hardware transactions: this allows critical sections to be executed in parallel using a speculative approach, while leveraging on conflict detection capabilities provided by hardware transactions to ensure equivalent semantics to pessimistic lock-based synchronization. In this paper we present RW-LE, the first HLE approach targeting read-write locks. RW-LE introduces an innovative hardware-software co-design that exploits two recent micro-architectural features of POWER8 processors: suspending/resuming transaction execution and rollback-only transactions. RW-LE's original design provides two major benefits with respect to existing HLE techniques: i) eliding the read lock without resorting to the use of hardware transactions, and ii) avoiding to track read memory accesses issued in the write critical section. We evaluate RW-LE by means of an extensive experimental study based on a variety of benchmarks and real-life, complex applications. Our results demonstrate that RW-LE can provide striking performance gain of up to one order of magnitude with respect to state of the art HLE approaches.

硬件锁省略(HLE)是一种很有前途的技术，它可以增强依赖传统的、基于锁的同步的并发应用程序的并行性。当前HLE方法的基本思想是将临界区封装到硬件事务中:这允许使用推测方法并行执行临界区，同时利用硬件事务提供的冲突检测功能来确保与基于锁的悲观同步等效的语义。在本文中，我们介绍了RW-LE，这是第一种针对读写锁的HLE方法。RW-LE引入了一种创新的软硬件协同设计，它利用了POWER8处理器的两个最新微体系结构特性:挂起/恢复事务执行和仅回滚事务。与现有的HLE技术相比，RW-LE的原始设计提供了两个主要优点:i)在不使用硬件事务的情况下省略读锁，ii)避免跟踪在写临界区发出的读内存访问。我们通过基于各种基准和现实生活中复杂应用的广泛实验研究来评估RW-LE。我们的研究结果表明，相对于最先进的HLE方法，RW-LE可以提供高达一个数量级的显著性能增益。

{"title":"Hardware read-write lock elision","authors":"P. Felber, S. Issa, A. Matveev, P. Romano","doi":"10.1145/2901318.2901346","DOIUrl":"https://doi.org/10.1145/2901318.2901346","url":null,"abstract":"Hardware Lock Elision (HLE) represents a promising technique to enhance parallelism of concurrent applications relying on conventional, lock-based synchronization. The idea at the basis of current HLE approaches is to wrap critical sections into hardware transactions: this allows critical sections to be executed in parallel using a speculative approach, while leveraging on conflict detection capabilities provided by hardware transactions to ensure equivalent semantics to pessimistic lock-based synchronization. In this paper we present RW-LE, the first HLE approach targeting read-write locks. RW-LE introduces an innovative hardware-software co-design that exploits two recent micro-architectural features of POWER8 processors: suspending/resuming transaction execution and rollback-only transactions. RW-LE's original design provides two major benefits with respect to existing HLE techniques: i) eliding the read lock without resorting to the use of hardware transactions, and ii) avoiding to track read memory accesses issued in the write critical section. We evaluate RW-LE by means of an extensive experimental study based on a variety of benchmarks and real-life, complex applications. Our results demonstrate that RW-LE can provide striking performance gain of up to one order of magnitude with respect to state of the art HLE approaches.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78938106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Data tiering in heterogeneous memory systems 异构内存系统中的数据分层

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901344

Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, N. Sundaram, N. Satish, R. Sankaran, Jeffrey R. Jackson, K. Schwan

Memory-based data center applications require increasingly large memory capacities, but face the challenges posed by the inherent difficulties in scaling DRAM and also the cost of DRAM. Future systems are attempting to address these demands with heterogeneous memory architectures coupling DRAM with high capacity, low cost, but also lower performance, non-volatile memories (NVM) such as PCM and RRAM. A key usage model intended for NVM is as cheaper high capacity volatile memory. Data center operators are bound to ask whether this model for the usage of NVM to replace the majority of DRAM memory leads to a large slowdown in their applications? It is crucial to answer this question because a large performance impact will be an impediment to the adoption of such systems. This paper presents a thorough study of representative applications -- including a key-value store (MemC3), an in-memory database (VoltDB), and a graph analytics framework (GraphMat) -- on a platform that is capable of emulating a mix of memory technologies. Our conclusions are that it is indeed possible to use a mix of a small amount of fast DRAM and large amounts of slower NVM without a proportional impact to an application's performance. The caveat is that this result can only be achieved through careful placement of data structures. The contribution of this paper is the design and implementation of a set of libraries and automatic tools that enables programmers to achieve optimal data placement with minimal effort on their part. With such guided placement and with DRAM constituting only 6% of the total memory footprint for GraphMat and 25% for VoltDB and MemC3 (remaining memory is NVM with 4x higher latency and 8x lower bandwidth than DRAM), we show that our target applications demonstrate only a 13% to 40% slowdown. Without guided placement, these applications see, in the worst case, 1.5x to 5.9x slowdown on the same configuration. Based on a realistic assumption that NVM will be 5x cheaper (per bit) than DRAM, this hybrid solution also results in 2x to 2.8x better performance/$ than a DRAM-only system.

基于内存的数据中心应用需要越来越大的内存容量，但面临着扩展DRAM的固有困难和DRAM的成本所带来的挑战。未来的系统正试图通过将DRAM与高容量、低成本、但性能较低的非易失性存储器(NVM)(如PCM和RRAM)相结合来解决这些需求。用于NVM的一个关键使用模型是更便宜的高容量易失性存储器。数据中心运营商必然会问，这种使用NVM取代大部分DRAM内存的模式是否会导致其应用程序大幅放缓?回答这个问题至关重要，因为巨大的性能影响将成为采用此类系统的障碍。本文在一个能够模拟多种内存技术的平台上，对具有代表性的应用程序(包括键值存储(MemC3)、内存数据库(voldb)和图形分析框架(GraphMat))进行了深入的研究。我们的结论是，混合使用少量快速DRAM和大量较慢的NVM，而不会对应用程序的性能产生成比例的影响，这确实是可能的。需要注意的是，这个结果只能通过仔细放置数据结构来实现。本文的贡献是设计和实现了一组库和自动工具，使程序员能够以最小的努力实现最佳的数据放置。有了这样的引导布局，并且DRAM仅占GraphMat总内存占用的6%，占voldb和MemC3的25%(剩余内存是NVM，延迟比DRAM高4倍，带宽比DRAM低8倍)，我们发现我们的目标应用程序仅表现出13%到40%的减速。如果没有引导放置，在最坏的情况下，这些应用程序在相同配置下的速度会降低1.5到5.9倍。基于一个现实的假设，NVM比DRAM便宜5倍(每比特)，这种混合解决方案的性能也比只使用DRAM的系统高2倍到2.8倍。

{"title":"Data tiering in heterogeneous memory systems","authors":"Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, N. Sundaram, N. Satish, R. Sankaran, Jeffrey R. Jackson, K. Schwan","doi":"10.1145/2901318.2901344","DOIUrl":"https://doi.org/10.1145/2901318.2901344","url":null,"abstract":"Memory-based data center applications require increasingly large memory capacities, but face the challenges posed by the inherent difficulties in scaling DRAM and also the cost of DRAM. Future systems are attempting to address these demands with heterogeneous memory architectures coupling DRAM with high capacity, low cost, but also lower performance, non-volatile memories (NVM) such as PCM and RRAM. A key usage model intended for NVM is as cheaper high capacity volatile memory. Data center operators are bound to ask whether this model for the usage of NVM to replace the majority of DRAM memory leads to a large slowdown in their applications? It is crucial to answer this question because a large performance impact will be an impediment to the adoption of such systems. This paper presents a thorough study of representative applications -- including a key-value store (MemC3), an in-memory database (VoltDB), and a graph analytics framework (GraphMat) -- on a platform that is capable of emulating a mix of memory technologies. Our conclusions are that it is indeed possible to use a mix of a small amount of fast DRAM and large amounts of slower NVM without a proportional impact to an application's performance. The caveat is that this result can only be achieved through careful placement of data structures. The contribution of this paper is the design and implementation of a set of libraries and automatic tools that enables programmers to achieve optimal data placement with minimal effort on their part. With such guided placement and with DRAM constituting only 6% of the total memory footprint for GraphMat and 25% for VoltDB and MemC3 (remaining memory is NVM with 4x higher latency and 8x lower bandwidth than DRAM), we show that our target applications demonstrate only a 13% to 40% slowdown. Without guided placement, these applications see, in the worst case, 1.5x to 5.9x slowdown on the same configuration. Based on a realistic assumption that NVM will be 5x cheaper (per bit) than DRAM, this hybrid solution also results in 2x to 2.8x better performance/$ than a DRAM-only system.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76422227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 191

Increasing large-scale data center capacity by statistical power control 通过统计功率控制提高大型数据中心的容量

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901338

Guosai Wang, Shuhao Wang, Bing Luo, Weisong Shi, Yinghan Zhu, Wenjun Yang, Dianming Hu, Longbo Huang, Xin Jin, W. Xu

Given the high cost of large-scale data centers, an important design goal is to fully utilize available power resources to maximize the computing capacity. In this paper we present Ampere, a novel power management system for data centers to increase the computing capacity by over-provisioning the number of servers. Instead of doing power capping that degrades the performance of running jobs, we use a statistical control approach to implement dynamic power management by indirectly affecting the workload scheduling, which can enormously reduce the risk of power violations. Instead of being a part of the already over-complicated scheduler, Ampere only interacts with the scheduler with two basic APIs. Instead of power control on the rack level, we impose power constraint on the row level, which leads to more room for over provisioning. We have implemented and deployed Ampere in our production data center. Controlled experiments on 400+ servers show that by adding 17% servers, we can increase the throughput of the data center by 15%, leading to significant cost savings while bringing no disturbances to the job performance.

考虑到大规模数据中心的高成本，一个重要的设计目标是充分利用可用的电力资源，以最大限度地提高计算能力。在本文中，我们提出了一种新的电源管理系统Ampere，用于数据中心通过过度配置服务器数量来增加计算能力。我们没有使用降低作业运行性能的功率上限，而是使用统计控制方法通过间接影响工作负载调度来实现动态电源管理，这可以极大地降低电源违规的风险。Ampere没有成为已经过于复杂的调度器的一部分，而是仅通过两个基本api与调度器交互。我们没有在机架级别上进行功率控制，而是在行级别上施加功率约束，这将为过度供应提供更多空间。我们已经在生产数据中心中实现并部署了Ampere。在400多台服务器上进行的控制实验表明，通过增加17%的服务器，我们可以将数据中心的吞吐量提高15%，从而在不影响工作性能的情况下显著节省成本。

{"title":"Increasing large-scale data center capacity by statistical power control","authors":"Guosai Wang, Shuhao Wang, Bing Luo, Weisong Shi, Yinghan Zhu, Wenjun Yang, Dianming Hu, Longbo Huang, Xin Jin, W. Xu","doi":"10.1145/2901318.2901338","DOIUrl":"https://doi.org/10.1145/2901318.2901338","url":null,"abstract":"Given the high cost of large-scale data centers, an important design goal is to fully utilize available power resources to maximize the computing capacity. In this paper we present Ampere, a novel power management system for data centers to increase the computing capacity by over-provisioning the number of servers. Instead of doing power capping that degrades the performance of running jobs, we use a statistical control approach to implement dynamic power management by indirectly affecting the workload scheduling, which can enormously reduce the risk of power violations. Instead of being a part of the already over-complicated scheduler, Ampere only interacts with the scheduler with two basic APIs. Instead of power control on the rack level, we impose power constraint on the row level, which leads to more room for over provisioning. We have implemented and deployed Ampere in our production data center. Controlled experiments on 400+ servers show that by adding 17% servers, we can increase the throughput of the data center by 15%, leading to significant cost savings while bringing no disturbances to the job performance.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85053115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Efficient queue management for cluster scheduling 用于集群调度的高效队列管理

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901354

Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, M. Vojnović, Sriram Rao

Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that maintaining queues of tasks at worker nodes has significant benefits. On one hand, centralized approaches do not use worker-side queues. Given the inherent feedback delays that these systems incur, they achieve suboptimal cluster utilization, particularly for workloads dominated by short tasks. On the other hand, distributed schedulers typically do employ worker-side queuing, and achieve higher cluster utilization. However, they fail to place tasks at the best possible machine, since they lack cluster-wide information, leading to worse job completion time, especially for heterogeneous workloads. To the best of our knowledge, this is the first work to provide principled solutions to the above problems by introducing queue management techniques, such as appropriate queue sizing, prioritization of task execution via queue reordering, starvation freedom, and careful placement of tasks to queues. We instantiate our techniques by extending both a centralized (YARN) and a distributed (Mercury) scheduler, and evaluate their performance on a wide variety of synthetic and production workloads derived from Microsoft clusters. Our centralized implementation, Yaq-c, achieves 1.7x improvement on median job completion time compared to YARN, and our distributed one, Yaq-d, achieves 9.3x improvement over an implementation of Sparrow's batch sampling on Mercury.

在大数据集群中，作业调度对于集群运营商的投资回报和整体用户体验都至关重要。在这种情况下，我们观察到现代集群调度器管理队列的一些异常情况，并认为在工作节点上维护任务队列具有显著的好处。一方面，集中式方法不使用工作端队列。考虑到这些系统产生的固有反馈延迟，它们实现了次优集群利用率，特别是对于由短任务主导的工作负载。另一方面，分布式调度器通常使用工作端队列，并实现更高的集群利用率。然而，它们无法将任务放置在最佳机器上，因为它们缺乏集群范围的信息，从而导致较差的作业完成时间，特别是对于异构工作负载。据我们所知，这是第一个通过引入队列管理技术为上述问题提供原则性解决方案的工作，例如适当的队列大小、通过队列重新排序来确定任务执行的优先级、消除饥饿以及将任务小心地放置到队列中。我们通过扩展集中式(YARN)和分布式(Mercury)调度器来实例化我们的技术，并在来自Microsoft集群的各种合成和生产工作负载上评估它们的性能。我们的集中式实现Yaq-c在任务完成时间中位数上比YARN提高了1.7倍，而我们的分布式实现Yaq-d在水星上比Sparrow的批量采样实现提高了9.3倍。

{"title":"Efficient queue management for cluster scheduling","authors":"Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, M. Vojnović, Sriram Rao","doi":"10.1145/2901318.2901354","DOIUrl":"https://doi.org/10.1145/2901318.2901354","url":null,"abstract":"Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that maintaining queues of tasks at worker nodes has significant benefits. On one hand, centralized approaches do not use worker-side queues. Given the inherent feedback delays that these systems incur, they achieve suboptimal cluster utilization, particularly for workloads dominated by short tasks. On the other hand, distributed schedulers typically do employ worker-side queuing, and achieve higher cluster utilization. However, they fail to place tasks at the best possible machine, since they lack cluster-wide information, leading to worse job completion time, especially for heterogeneous workloads. To the best of our knowledge, this is the first work to provide principled solutions to the above problems by introducing queue management techniques, such as appropriate queue sizing, prioritization of task execution via queue reordering, starvation freedom, and careful placement of tasks to queues. We instantiate our techniques by extending both a centralized (YARN) and a distributed (Mercury) scheduler, and evaluate their performance on a wide variety of synthetic and production workloads derived from Microsoft clusters. Our centralized implementation, Yaq-c, achieves 1.7x improvement on median job completion time compared to YARN, and our distributed one, Yaq-d, achieves 9.3x improvement over an implementation of Sparrow's batch sampling on Mercury.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82433209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104

Crayon: saving power through shape and color approximation on next-generation displays 蜡笔:在下一代显示器上通过形状和颜色近似来节省电力

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901347

Phillip Stanley-Marbell, V. Estellers, M. Rinard

We present Crayon, a library and runtime system that reduces display power dissipation by acceptably approximating displayed images via shape and color transforms. Crayon can be inserted between an application and the display to optimize dynamically generated images before they appear on the screen. It can also be applied offline to optimize stored images before they are retrieved and displayed. Crayon exploits three fundamental properties: the acceptability of small changes in shape and color, the fact that the power dissipation of OLED displays and DLP pico-projectors is different for different colors, and the relatively small energy cost of computation in comparison to display energy usage. We implement and evaluate Crayon in three contexts: a hardware platform with detailed power measurement facilities and an OLED display, an Android tablet, and a set of cross-platform tools. Our results show that Crayon's color transforms can reduce display power dissipation by over 66% while producing images that remain visually acceptable to users. The measured whole-system power reduction is approximately 50%. We quantify the acceptability of Crayon's shape and color transforms with a user study involving over 400 participants and over 21,000 image evaluations.

我们展示了Crayon，一个库和运行时系统，通过形状和颜色变换可接受地近似显示图像，从而降低了显示功耗。蜡笔可以插入应用程序和显示器之间，优化动态生成的图像之前，他们出现在屏幕上。它还可以离线应用，在检索和显示存储的图像之前对其进行优化。蜡笔利用了三个基本特性:形状和颜色的微小变化的可接受性，OLED显示器和DLP微型投影仪对不同颜色的功耗是不同的，与显示能耗相比，计算的能量成本相对较小。我们在三种情况下实施和评估Crayon:一个具有详细功率测量设施和OLED显示器的硬件平台，一个Android平板电脑和一套跨平台工具。我们的研究结果表明，Crayon的颜色变换可以减少66%以上的显示功耗，同时生成用户在视觉上仍然可以接受的图像。测量的整个系统功耗降低约为50%。我们量化了蜡笔形状和颜色变换的可接受性，涉及400多名参与者和超过21,000个图像评估的用户研究。

{"title":"Crayon: saving power through shape and color approximation on next-generation displays","authors":"Phillip Stanley-Marbell, V. Estellers, M. Rinard","doi":"10.1145/2901318.2901347","DOIUrl":"https://doi.org/10.1145/2901318.2901347","url":null,"abstract":"We present Crayon, a library and runtime system that reduces display power dissipation by acceptably approximating displayed images via shape and color transforms. Crayon can be inserted between an application and the display to optimize dynamically generated images before they appear on the screen. It can also be applied offline to optimize stored images before they are retrieved and displayed. Crayon exploits three fundamental properties: the acceptability of small changes in shape and color, the fact that the power dissipation of OLED displays and DLP pico-projectors is different for different colors, and the relatively small energy cost of computation in comparison to display energy usage. We implement and evaluate Crayon in three contexts: a hardware platform with detailed power measurement facilities and an OLED display, an Android tablet, and a set of cross-platform tools. Our results show that Crayon's color transforms can reduce display power dissipation by over 66% while producing images that remain visually acceptable to users. The measured whole-system power reduction is approximately 50%. We quantify the acceptability of Crayon's shape and color transforms with a user study involving over 400 participants and over 21,000 image evaluations.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89291085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

pVM: persistent virtual memory for efficient capacity scaling and object storage pVM:持久虚拟内存，用于有效的容量扩展和对象存储

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901325

Sudarsun Kannan, Ada Gavrilovska, K. Schwan

Next-generation byte-addressable nonvolatile memories (NVMs), such as phase change memory (PCM) and Memristors, promise fast data storage, and more importantly, address DRAM scalability issues. State-of-the-art OS mechanisms for NVMs have focused on improving the block-based virtual file system (VFS) to manage both persistence and the memory capacity scaling needs of applications. However, using the VFS for capacity scaling has several limitations, such as the lack of automatic memory capacity scaling across DRAM and NVM, inefficient use of the processor cache and TLB, and high page access costs. These limitations reduce application performance and also impact applications that use NVM for persistent object storage with flat namespaces, such as photo stores, NoSQL databases, and others. To address such limitations, we propose persistent virtual memory (pVM), a system software abstraction that provides applications with (1) automatic OS-level memory capacity scaling, (2) flexible memory placement policies across NVM, and (3) fast object storage. pVM extends the OS virtual memory (VM) instead of building on the VFS and abstracts NVM as a NUMA node with support for NVM-based memory placement mechanisms. pVM inherits benefits from the cache and TLB-efficient VM subsystem and augments these further by distinguishing between persistent and nonpersistent capacity use of NVM. Additionally, pVM achieves fast persistent storage by further extending the VM subsystem with consistent and durable OS-level persistent metadata. Our evaluation of pVM with memory capacity-intensive applications shows a 2.5x speedup and up to 80% lower TLB and cache misses compared to VFS-based systems. pVM's object store provides 2x higher throughput compared to the block-based approach of the state-of-the art solution and up to a 4x reduction in the time spent in the OS.

下一代字节可寻址非易失性存储器(nvm)，如相变存储器(PCM)和忆阻器，承诺快速数据存储，更重要的是，解决了DRAM的可扩展性问题。nvm最先进的操作系统机制集中于改进基于块的虚拟文件系统(VFS)，以管理应用程序的持久性和内存容量扩展需求。然而，使用VFS进行容量扩展有几个限制，例如缺乏跨DRAM和NVM的自动内存容量扩展，处理器缓存和TLB的低效使用，以及高页面访问成本。这些限制降低了应用程序性能，也影响了使用NVM进行具有平面名称空间的持久对象存储的应用程序，如照片存储、NoSQL数据库等。为了解决这些限制，我们提出了持久虚拟内存(pVM)，这是一种系统软件抽象，为应用程序提供(1)自动操作系统级内存容量缩放，(2)跨NVM的灵活内存放置策略，以及(3)快速对象存储。pVM扩展了操作系统虚拟内存(VM)，而不是构建在VFS上，并将NVM抽象为NUMA节点，支持基于NVM的内存放置机制。pVM继承了缓存和tlb高效VM子系统的优点，并通过区分NVM的持久和非持久容量使用进一步增强了这些优点。此外，pVM通过使用一致和持久的操作系统级持久元数据进一步扩展VM子系统来实现快速持久存储。我们对具有内存容量密集型应用程序的pVM的评估显示，与基于vfs的系统相比，pVM的速度提高了2.5倍，TLB和缓存丢失减少了80%。与最先进的基于块的解决方案相比，pVM的对象存储提供了2倍的吞吐量，并将在操作系统上花费的时间减少了4倍。

{"title":"pVM: persistent virtual memory for efficient capacity scaling and object storage","authors":"Sudarsun Kannan, Ada Gavrilovska, K. Schwan","doi":"10.1145/2901318.2901325","DOIUrl":"https://doi.org/10.1145/2901318.2901325","url":null,"abstract":"Next-generation byte-addressable nonvolatile memories (NVMs), such as phase change memory (PCM) and Memristors, promise fast data storage, and more importantly, address DRAM scalability issues. State-of-the-art OS mechanisms for NVMs have focused on improving the block-based virtual file system (VFS) to manage both persistence and the memory capacity scaling needs of applications. However, using the VFS for capacity scaling has several limitations, such as the lack of automatic memory capacity scaling across DRAM and NVM, inefficient use of the processor cache and TLB, and high page access costs. These limitations reduce application performance and also impact applications that use NVM for persistent object storage with flat namespaces, such as photo stores, NoSQL databases, and others. To address such limitations, we propose persistent virtual memory (pVM), a system software abstraction that provides applications with (1) automatic OS-level memory capacity scaling, (2) flexible memory placement policies across NVM, and (3) fast object storage. pVM extends the OS virtual memory (VM) instead of building on the VFS and abstracts NVM as a NUMA node with support for NVM-based memory placement mechanisms. pVM inherits benefits from the cache and TLB-efficient VM subsystem and augments these further by distinguishing between persistent and nonpersistent capacity use of NVM. Additionally, pVM achieves fast persistent storage by further extending the VM subsystem with consistent and durable OS-level persistent metadata. Our evaluation of pVM with memory capacity-intensive applications shows a 2.5x speedup and up to 80% lower TLB and cache misses compared to VFS-based systems. pVM's object store provides 2x higher throughput compared to the block-based approach of the state-of-the art solution and up to a 4x reduction in the time spent in the OS.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83773900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Oasis: energy proportionality with hybrid server consolidation Oasis:混合服务器整合的能量比例

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901333

Junji Zhi, Nilton Bila, E. D. Lara

Cloud data centers operate at very low utilization rates resulting in significant energy waste. Oasis is a new approach for energy-oriented cluster management that enables dense server consolidation. Oasis achieves high consolidation ratios by combining traditional full VM migration with partial VM migration. Partial VM migration is used to densely consolidate the working sets of idle VMs by migrating on-demand only the pages that are accessed by the idle VMs to a consolidation host. Full VM migration is used to dynamically adapt the placement of VMs so that hosts are free from active VMs. Oasis sizes the cluster and saves energy by placing hosts without active VMs into sleep mode. It uses a low-power memory server design to allow the sleeping hosts to continue to service memory requests. In a simulated VDI server farm, our prototype saves energy by up to 28% on weekdays and 43% on weekends with minimal impact on the user productivity.

云数据中心以非常低的利用率运行，导致严重的能源浪费。Oasis是面向能源的集群管理的一种新方法，它支持密集的服务器整合。Oasis通过结合传统的全虚拟机迁移和部分虚拟机迁移实现了高整合率。虚拟机部分迁移是指将空闲虚拟机访问的页面按需迁移到整合主机上，将空闲虚拟机的工作集进行密集整合。虚拟机全量迁移是指动态调整虚拟机的位置，使主机不受活动虚拟机的影响。Oasis通过将没有活动虚拟机的主机置于休眠模式来调整集群大小并节省能源。它使用低功耗内存服务器设计，允许休眠主机继续处理内存请求。在模拟的VDI服务器群中，我们的原型在工作日节省了28%的能源，在周末节省了43%的能源，对用户生产力的影响最小。

引用次数: 9

Flint: batch-interactive data-intensive processing on transient servers Flint:在瞬态服务器上进行批量交互的数据密集型处理

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901319

Prateek Sharma, Tian Guo, Xin He, David E. Irwin, P. Shenoy

Cloud providers now offer transient servers, which they may revoke at anytime, for significantly lower prices than on-demand servers, which they cannot revoke. The low price of transient servers is particularly attractive for executing an emerging class of workload, which we call Batch-Interactive Data-Intensive (BIDI), that is becoming increasingly important for data analytics. BIDI workloads require large sets of servers to cache massive datasets in memory to enable low latency operation. In this paper, we illustrate the challenges of executing BIDI workloads on transient servers, where revocations (akin to failures) are the common case. To address these challenges, we design Flint, which is based on Spark and includes automated checkpointing and server selection policies that i) support batch and interactive applications and ii) dynamically adapt to application characteristics. We evaluate a prototype of Flint using EC2 spot instances, and show that it yields cost savings of up to 90% compared to using on-demand servers, while increasing running time by < 2%.

云提供商现在提供临时服务器，他们可以随时撤销，价格远低于他们不能撤销的按需服务器。临时服务器的低价格对于执行一类新兴的工作负载特别有吸引力，我们称之为批处理交互式数据密集型工作负载(BIDI)，它对数据分析变得越来越重要。BIDI工作负载需要大量服务器在内存中缓存大量数据集，以实现低延迟操作。在本文中，我们将说明在临时服务器上执行BIDI工作负载的挑战，其中撤销(类似于故障)是常见的情况。为了应对这些挑战，我们设计了Flint，它基于Spark，包括自动检查点和服务器选择策略，这些策略i)支持批处理和交互式应用程序，ii)动态适应应用程序的特征。我们使用EC2现场实例评估了Flint的原型，结果表明，与使用按需服务器相比，它可以节省高达90%的成本，而运行时间却增加了不到2%。

引用次数: 88

Parallel sections: scaling system-level data-structures 并行分段:扩展系统级数据结构

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901356

Qi Wang, Tim Stamler, Gabriel Parmer

As systems continue to increase the number of cores within cache coherency domains, traditional techniques for enabling parallel computation on data-structures are increasingly strained. A single contended cache-line bouncing between different caches can prohibit continued performance gains with additional cores. New abstractions and mechanisms are required to reassess how data-structure consistency can be provided, while maintaining stable per-core access latencies. This paper presents the Parallel Sections (ParSec) abstraction for mediating access to shared data-structures. Fundamental to the approach is a new form of scalable memory reclamation that leverages fast local access to real-time to globally order system events. This approach attempts to minimize coherency-traffic, while harnessing the benefit of shared read-mostly cache-lines. We show that the co-management of scalable memory reclamation, memory allocation, locking, and namespace management enables scalable system service implementation. We apply ParSec to both memcached, and virtual memory management in a microkernel, and find order-of magnitude performance increases on a four socket, 40 core machine, and 30x lower 99th percentile latencies for virtual memory management.

随着系统不断增加缓存一致性域中的核心数量，在数据结构上实现并行计算的传统技术越来越紧张。单个争用缓存线在不同缓存之间的弹跳可能会阻止额外核心的持续性能提升。需要新的抽象和机制来重新评估如何提供数据结构一致性，同时保持稳定的每核访问延迟。本文提出了并行段(ParSec)抽象，用于中介对共享数据结构的访问。该方法的基础是一种新的可扩展内存回收形式，它利用快速本地访问实时到全局顺序系统事件。这种方法尝试最小化一致性流量，同时利用共享读取(主要是缓存)行的优势。我们展示了可伸缩内存回收、内存分配、锁定和名称空间管理的共同管理支持可伸缩的系统服务实现。我们将ParSec应用于微内核中的memcached和虚拟内存管理，发现在4个套接字、40个核心的机器上，性能有了数量级的提高，虚拟内存管理的第99百分位延迟降低了30倍。

{"title":"Parallel sections: scaling system-level data-structures","authors":"Qi Wang, Tim Stamler, Gabriel Parmer","doi":"10.1145/2901318.2901356","DOIUrl":"https://doi.org/10.1145/2901318.2901356","url":null,"abstract":"As systems continue to increase the number of cores within cache coherency domains, traditional techniques for enabling parallel computation on data-structures are increasingly strained. A single contended cache-line bouncing between different caches can prohibit continued performance gains with additional cores. New abstractions and mechanisms are required to reassess how data-structure consistency can be provided, while maintaining stable per-core access latencies. This paper presents the Parallel Sections (ParSec) abstraction for mediating access to shared data-structures. Fundamental to the approach is a new form of scalable memory reclamation that leverages fast local access to real-time to globally order system events. This approach attempts to minimize coherency-traffic, while harnessing the benefit of shared read-mostly cache-lines. We show that the co-management of scalable memory reclamation, memory allocation, locking, and namespace management enables scalable system service implementation. We apply ParSec to both memcached, and virtual memory management in a microkernel, and find order-of magnitude performance increases on a four socket, 40 core machine, and 30x lower 99th percentile latencies for virtual memory management.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77742267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

BB: booting booster for consumer electronics with modern OS 使用现代操作系统的消费类电子产品的启动助推器

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901320

Geunsik Lim, MyungJoo Ham

Unconventional computing platforms have spread widely and rapidly following smart phones and tablets: consumer electronics such as smart TVs and digital cameras. For such devices, fast booting is a critical requirement; waiting tens of seconds for a TV or a camera to boot up is not acceptable, unlike a PC or smart phone. Moreover, the software platforms of these devices have become as rich as conventional computing devices to provide comparable services. As a result, the booting procedure to start every required OS service, hardware component, and application, the quantity of which is ever increasing, may take unbearable time for most consumers. To accelerate booting, this paper introduces Booting Booster (BB), which is used in all 2015 Samsung Smart TV models, and which runs the Linux-based Tizen OS. BB addresses the init scheme of Linux, which launches initial user-space OS services and applications and manages the life cycles of all user processes, by identifying and isolating booting-critical tasks, deferring non-critical tasks, and enabling execution of more tasks in parallel. BB has been successfully deployed in Samsung Smart TV 2015 models achieving a cold boot in 3.5 s (compared to 8.1 s with full commercial-grade optimizations without BB) without the need for suspend-to-RAM or hibernation. After this successful deployment, we have released the source code via http://opensource.samsung.com/, and BB will be included in the open-source OS, Tizen (http://tizen.org/).

继智能手机和平板电脑之后，非传统的计算平台已经广泛而迅速地传播开来:智能电视和数码相机等消费电子产品。对于这样的设备，快速启动是一个关键的要求;与PC或智能手机不同，等待几十秒钟让电视或相机启动是不可接受的。此外，这些设备的软件平台已经变得和传统的计算设备一样丰富，可以提供类似的服务。因此，启动每一个必需的操作系统服务、硬件组件和应用程序的引导过程(它们的数量在不断增加)可能会花费大多数消费者难以忍受的时间。为了加速启动，本文介绍了booting Booster (BB)，这是2015年三星所有智能电视型号中使用的，它运行基于linux的Tizen操作系统。BB解决了Linux的init方案，它启动初始用户空间操作系统服务和应用程序，并管理所有用户进程的生命周期，通过识别和隔离启动关键任务，延迟非关键任务，并启用并行执行更多任务。BB已成功部署在三星智能电视2015型号中，无需暂停到ram或休眠，即可在3.5秒内实现冷启动(相比之下，没有BB的完全商用级优化为8.1秒)。在这次成功部署之后，我们已经通过http://opensource.samsung.com/发布了源代码，并且BB将包含在开源操作系统Tizen (http://tizen.org/)中。

{"title":"BB: booting booster for consumer electronics with modern OS","authors":"Geunsik Lim, MyungJoo Ham","doi":"10.1145/2901318.2901320","DOIUrl":"https://doi.org/10.1145/2901318.2901320","url":null,"abstract":"Unconventional computing platforms have spread widely and rapidly following smart phones and tablets: consumer electronics such as smart TVs and digital cameras. For such devices, fast booting is a critical requirement; waiting tens of seconds for a TV or a camera to boot up is not acceptable, unlike a PC or smart phone. Moreover, the software platforms of these devices have become as rich as conventional computing devices to provide comparable services. As a result, the booting procedure to start every required OS service, hardware component, and application, the quantity of which is ever increasing, may take unbearable time for most consumers. To accelerate booting, this paper introduces Booting Booster (BB), which is used in all 2015 Samsung Smart TV models, and which runs the Linux-based Tizen OS. BB addresses the init scheme of Linux, which launches initial user-space OS services and applications and manages the life cycles of all user processes, by identifying and isolating booting-critical tasks, deferring non-critical tasks, and enabling execution of more tasks in parallel. BB has been successfully deployed in Samsung Smart TV 2015 models achieving a cold boot in 3.5 s (compared to 8.1 s with full commercial-grade optimizations without BB) without the need for suspend-to-RAM or hibernation. After this successful deployment, we have released the source code via http://opensource.samsung.com/, and BB will be included in the open-source OS, Tizen (http://tizen.org/).","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84588977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2