Proceedings of the Eleventh European Conference on Computer Systems最新文献_第3页

Shared address translation revisited 重新访问共享地址转换

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901327

Xiaowan Dong, S. Dwarkadas, A. Cox

Modern operating systems avoid duplication of code and data when they are mapped by multiple processes by sharing physical memory through mechanisms like copy-on-write. Nonetheless, a separate copy of the virtual address translation structures, such as page tables, are still maintained for each process, even if they are identical. This duplication can lead to inefficiencies in the address translation process and interference within the memory hierarchy. In this paper, we show that on Android platforms, sharing address translation structures, specifically, page tables and TLB entries, for shared libraries can improve performance. For example, at a low level, sharing address translation structures reduces the cost of fork by more than half by reducing page table construction overheads. At a higher level, application launch and IPC are faster due to page fault elimination coupled with better cache and TLB performance when context switching.

现代操作系统通过诸如写时复制(copy-on-write)之类的机制共享物理内存，从而避免了由多个进程映射代码和数据时的重复。尽管如此，虚拟地址转换结构(如页表)的单独副本仍然为每个进程维护，即使它们是相同的。这种重复可能导致地址转换过程的效率低下和内存层次结构中的干扰。在本文中，我们展示了在Android平台上，共享库的地址转换结构，特别是页表和TLB条目，可以提高性能。例如，在较低的层次上，共享地址转换结构通过减少页表构建开销，将fork的成本降低了一半以上。在更高的级别上，应用程序启动和IPC更快，这是由于在上下文切换时消除了页面错误以及更好的缓存和TLB性能。

引用次数: 18

A high performance file system for non-volatile main memory 用于非易失性主存储器的高性能文件系统

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901324

Jiaxin Ou, J. Shu, Youyou Lu

Emerging non-volatile main memories (NVMMs) provide data persistence at the main memory level. To avoid the double-copy overheads among the user buffer, the OS page cache, and the storage layer, state-of-the-art NVMM-aware file systems bypass the OS page cache which directly copy data between the user buffer and the NVMM storage. However, one major drawback of existing NVMM technologies is the slow writes. As a result, such direct access for all file operations can lead to suboptimal system performance. In this paper, we propose HiNFS, a high performance file system for non-volatile main memory. Specifically, HiNFS uses an NVMM-aware Write Buffer policy to buffer the lazy-persistent file writes in DRAM and persists them to NVMM lazily to hide the long write latency of NVMM. However, HiNFS performs direct access to NVMM for eager-persistent file writes, and directly reads file data from both DRAM and NVMM as they have similar read performance, in order to eliminate the double-copy overheads from the critical path. To ensure read consistency, HiNFS uses a combination of the DRAM Block Index and Cacheline Bitmap to track the latest data between DRAM and NVMM. Finally, HiNFS employs a Buffer Benefit Model to identify the eager-persistent file writes before issuing the write operations. Using software NVMM emulators, we evaluate HiNFS's performance with various workloads. Comparing with state-of-the-art NVMM-aware file systems - PMFS and EXT4-DAX, surprisingly, our results show that HiNFS improves the system throughput by up to 184% for filebench microbenchmarks and reduces the execution time by up to 64% for data-intensive traces and macro-benchmarks, demonstrating the benefits of hiding the long write latency of NVMM.

新兴的非易失性主存储器(nvmm)在主存储器级别提供数据持久性。为了避免用户缓冲区、操作系统页面缓存和存储层之间的双重复制开销，最先进的NVMM感知文件系统绕过直接在用户缓冲区和NVMM存储之间复制数据的操作系统页面缓存。然而，现有NVMM技术的一个主要缺点是写入速度慢。因此，对所有文件操作的这种直接访问可能导致系统性能不理想。本文提出了一种高性能的非易失性主存文件系统hfs。具体来说，HiNFS使用NVMM感知的Write Buffer策略来缓冲DRAM中的延迟持久文件写入，并将它们延迟地持久化到NVMM，以隐藏NVMM的长写延迟。然而，HiNFS执行对NVMM的直接访问，以便进行渴望持久的文件写入，并直接从DRAM和NVMM读取文件数据，因为它们具有相似的读取性能，以便消除关键路径的双重复制开销。为了确保读取一致性，HiNFS使用DRAM块索引和Cacheline位图的组合来跟踪DRAM和NVMM之间的最新数据。最后，HiNFS使用Buffer Benefit Model在发出写操作之前识别渴望持久的文件写操作。使用软件NVMM仿真器，我们在各种工作负载下评估了HiNFS的性能。与最先进的NVMM感知文件系统(PMFS和EXT4-DAX)相比，令人惊讶的是，我们的结果表明，在文件工作台微基准测试中，hfs将系统吞吐量提高了184%，在数据密集型跟踪和宏观基准测试中，hfs将执行时间减少了64%，这表明了隐藏NVMM长写延迟的好处。

{"title":"A high performance file system for non-volatile main memory","authors":"Jiaxin Ou, J. Shu, Youyou Lu","doi":"10.1145/2901318.2901324","DOIUrl":"https://doi.org/10.1145/2901318.2901324","url":null,"abstract":"Emerging non-volatile main memories (NVMMs) provide data persistence at the main memory level. To avoid the double-copy overheads among the user buffer, the OS page cache, and the storage layer, state-of-the-art NVMM-aware file systems bypass the OS page cache which directly copy data between the user buffer and the NVMM storage. However, one major drawback of existing NVMM technologies is the slow writes. As a result, such direct access for all file operations can lead to suboptimal system performance. In this paper, we propose HiNFS, a high performance file system for non-volatile main memory. Specifically, HiNFS uses an NVMM-aware Write Buffer policy to buffer the lazy-persistent file writes in DRAM and persists them to NVMM lazily to hide the long write latency of NVMM. However, HiNFS performs direct access to NVMM for eager-persistent file writes, and directly reads file data from both DRAM and NVMM as they have similar read performance, in order to eliminate the double-copy overheads from the critical path. To ensure read consistency, HiNFS uses a combination of the DRAM Block Index and Cacheline Bitmap to track the latest data between DRAM and NVMM. Finally, HiNFS employs a Buffer Benefit Model to identify the eager-persistent file writes before issuing the write operations. Using software NVMM emulators, we evaluate HiNFS's performance with various workloads. Comparing with state-of-the-art NVMM-aware file systems - PMFS and EXT4-DAX, surprisingly, our results show that HiNFS improves the system throughput by up to 184% for filebench microbenchmarks and reduces the execution time by up to 64% for data-intensive traces and macro-benchmarks, demonstrating the benefits of hiding the long write latency of NVMM.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85409993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

A study of modern Linux API usage and compatibility: what to support when you're supporting 对现代Linux API使用和兼容性的研究:在支持时应该支持什么

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901341

Chia-che Tsai, Bhushan Jain, N. A. Abdul, Donald E. Porter

This paper presents a study of Linux API usage across all applications and libraries in the Ubuntu Linux 15.04 distribution. We propose metrics for reasoning about the importance of various system APIs, including system calls, pseudo-files, and libc functions. Our metrics are designed for evaluating the relative maturity of a prototype system or compatibility layer, and this paper focuses on compatibility with Linux applications. This study uses a combination of static analysis to understand API usage and survey data to weight the relative importance of applications to end users. This paper yields several insights for developers and researchers, which are useful for assessing the complexity and security of Linux APIs. For example, every Ubuntu installation requires 224 system calls, 208 ioctl, fcntl, and prctl codes and hundreds of pseudo files. For each API type, a significant number of APIs are rarely used, if ever. Moreover, several security-relevant API changes, such as replacing access with faccessat, have met with slow adoption. Finally, hundreds of libc interfaces are effectively unused, yielding opportunities to improve security and efficiency by restructuring libc.

本文对Ubuntu Linux 15.04发行版中所有应用程序和库的Linux API使用情况进行了研究。我们提出了衡量各种系统api(包括系统调用、伪文件和libc函数)重要性的指标。我们的度量是为评估原型系统或兼容性层的相对成熟度而设计的，本文关注的是与Linux应用程序的兼容性。本研究结合了静态分析来了解API的使用情况，并结合调查数据来衡量应用程序对最终用户的相对重要性。本文为开发人员和研究人员提供了一些见解，这些见解对于评估Linux api的复杂性和安全性非常有用。例如，每个Ubuntu安装需要224个系统调用，208个ioctl, fcntl和prctl代码以及数百个伪文件。对于每种API类型，有相当数量的API很少使用(如果使用过的话)。此外，一些与安全性相关的API更改(例如用faccesat替换访问)的采用速度很慢。最后，数百个libc接口实际上未被使用，从而有机会通过重组libc来提高安全性和效率。

{"title":"A study of modern Linux API usage and compatibility: what to support when you're supporting","authors":"Chia-che Tsai, Bhushan Jain, N. A. Abdul, Donald E. Porter","doi":"10.1145/2901318.2901341","DOIUrl":"https://doi.org/10.1145/2901318.2901341","url":null,"abstract":"This paper presents a study of Linux API usage across all applications and libraries in the Ubuntu Linux 15.04 distribution. We propose metrics for reasoning about the importance of various system APIs, including system calls, pseudo-files, and libc functions. Our metrics are designed for evaluating the relative maturity of a prototype system or compatibility layer, and this paper focuses on compatibility with Linux applications. This study uses a combination of static analysis to understand API usage and survey data to weight the relative importance of applications to end users. This paper yields several insights for developers and researchers, which are useful for assessing the complexity and security of Linux APIs. For example, every Ubuntu installation requires 224 system calls, 208 ioctl, fcntl, and prctl codes and hundreds of pseudo files. For each API type, a significant number of APIs are rarely used, if ever. Moreover, several security-relevant API changes, such as replacing access with faccessat, have met with slow adoption. Finally, hundreds of libc interfaces are effectively unused, yielding opportunities to improve security and efficiency by restructuring libc.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"2017 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82832484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage 部分并行修复(PPR):一种用于纠删编码存储的分布式修复技术

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901328

S. Mitra, R. Panta, Moo-Ryong Ra, S. Bagchi

With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log2(k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.

随着我们周围应用程序中数据的爆炸式增长，擦除编码存储已经成为复制的一种有吸引力的替代方案，因为即使存储开销显著降低，它们也能提供更好的可靠性，防止数据丢失。Reed-Solomon码是使用最广泛的擦除码，因为它为给定的存储开销提供了最大的可靠性，并且在选择确定可实现的可靠性的编码参数方面非常灵活。然而，不可用数据的重建时间变得非常长，这主要是因为网络瓶颈。一些建议的解决方案要么使用额外的存储空间，要么限制可以使用的编码参数。本文提出了一种新的分布式重构技术——部分并行修复(PPR)，该技术将重构操作划分为多个小的部分操作，并将其调度到多个已经参与数据重构的节点上。然后用分布式协议逐步组合这些部分结果来重构不可用的数据块，这种技术降低了网络压力。理论上，与(k, m)里德-所罗门码所需的k时间相比，我们的技术可以在(log2(k + 1))²时间内完成网络传输。我们的实验表明，PPR显著降低了修复时间和读取时间。此外，我们的技术与现有的擦除码兼容，并且不需要任何额外的存储开销。我们通过将PPR叠加在先前的两种方案(局部重建代码和旋转Reed-Solomon代码)之上来证明这一点，以获得额外的重建时间节省。

{"title":"Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage","authors":"S. Mitra, R. Panta, Moo-Ryong Ra, S. Bagchi","doi":"10.1145/2901318.2901328","DOIUrl":"https://doi.org/10.1145/2901318.2901328","url":null,"abstract":"With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log2(k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84774946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 105

Fast and general distributed transactions using RDMA and HTM 使用RDMA和HTM的快速和通用分布式事务

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901349

Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, Haibo Chen

Recent transaction processing systems attempt to leverage advanced hardware features like RDMA and HTM to significantly boost performance, which, however, pose several limitations like requiring priori knowledge of read/write sets of transactions and providing no availability support. In this paper, we present DrTM+R, a fast in-memory transaction processing system that retains the performance benefit from advanced hardware features, while supporting general transactional workloads and high availability through replication. DrTM+R addresses the generality issue by designing a hybrid OCC and locking scheme, which leverages the strong atomicity of HTM and the strong consistency of RDMA to preserve strict serializability with high performance. To resolve the race condition between the immediate visibility of records updated by HTM transactions and the unready replication of such records, DrTM+R leverages an optimistic replication scheme that uses seqlock-like versioning to distinguish the visibility of tuples and the readiness of record replication. Evaluation using typical OLTP workloads like TPC-C and SmallBank shows that DrTM+R scales well on a 6-node cluster and achieves over 5.69 and 94 million transactions per second without replication for TPC-C and SmallBank respectively. Enabling 3-way replication on DrTM+R only incurs at most 41% overhead before reaching network bottleneck, and is still an order-of-magnitude faster than a state-of-the-art distributed transaction system (Calvin).

最近的事务处理系统试图利用诸如RDMA和HTM之类的高级硬件特性来显著提高性能，然而，这带来了一些限制，例如需要预先了解事务的读/写集，并且不提供可用性支持。在本文中，我们介绍了DrTM+R，这是一个快速的内存事务处理系统，它保留了高级硬件特性带来的性能优势，同时通过复制支持一般事务工作负载和高可用性。DrTM+R通过设计混合OCC和锁定方案来解决通用性问题，该方案利用HTM的强原子性和RDMA的强一致性来保持高性能的严格序列化性。为了解决HTM事务更新的记录的即时可见性和这些记录的未准备好的复制之间的竞争条件，DrTM+R利用了一种乐观复制方案，该方案使用类似于序列锁的版本控制来区分元组的可见性和记录复制的准备情况。使用典型OLTP工作负载(如TPC-C和SmallBank)进行的评估表明，DrTM+R在6节点集群上的可扩展性很好，在不复制的情况下，TPC-C和SmallBank的每秒交易量分别超过569万和9400万。在DrTM+R上启用3-way复制在达到网络瓶颈之前只会产生最多41%的开销，并且仍然比最先进的分布式事务系统(Calvin)快一个数量级。

{"title":"Fast and general distributed transactions using RDMA and HTM","authors":"Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, Haibo Chen","doi":"10.1145/2901318.2901349","DOIUrl":"https://doi.org/10.1145/2901318.2901349","url":null,"abstract":"Recent transaction processing systems attempt to leverage advanced hardware features like RDMA and HTM to significantly boost performance, which, however, pose several limitations like requiring priori knowledge of read/write sets of transactions and providing no availability support. In this paper, we present DrTM+R, a fast in-memory transaction processing system that retains the performance benefit from advanced hardware features, while supporting general transactional workloads and high availability through replication. DrTM+R addresses the generality issue by designing a hybrid OCC and locking scheme, which leverages the strong atomicity of HTM and the strong consistency of RDMA to preserve strict serializability with high performance. To resolve the race condition between the immediate visibility of records updated by HTM transactions and the unready replication of such records, DrTM+R leverages an optimistic replication scheme that uses seqlock-like versioning to distinguish the visibility of tuples and the readiness of record replication. Evaluation using typical OLTP workloads like TPC-C and SmallBank shows that DrTM+R scales well on a 6-node cluster and achieves over 5.69 and 94 million transactions per second without replication for TPC-C and SmallBank respectively. Enabling 3-way replication on DrTM+R only incurs at most 41% overhead before reaching network bottleneck, and is still an order-of-magnitude faster than a state-of-the-art distributed transaction system (Calvin).","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78651637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 122

PSLO: enforcing the Xth percentile latency and throughput SLOs for consolidated VM storage PSLO:为合并的虚拟机存储执行第x百分位延迟和吞吐量slo

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901330

Ning Li, Hong Jiang, D. Feng, Zhan Shi

It is desirable but challenging to simultaneously support latency SLO at a pre-defined percentile, i.e., the Xth percentile latency SLO, and throughput SLO for consolidated VM storage. Ensuring the Xth percentile latency contributes to accurately differentiating service levels in the metric of the application-level latency SLO compliance, especially for the application built on multiple VMs. However, the Xth percentile latency SLO and throughput SLO enforcement are the opposite sides of the same coin due to the conflicting requirements for the level of IO concurrency. To address this challenge, this paper proposes PSLO, a framework supporting the Xth percentile latency and throughput SLOs under consolidated VM environment by precisely coordinating the level of IO concurrency and arrival rate for each VM issue queue. It is noted that PSLO can take full advantage of the available IO capacity allowed by SLO constraints to improve throughput or reduce latency with the best effort. We design and implement a PSLO prototype in the real VM consolidation environment created by Xen. Our extensive trace-driven prototype evaluation shows that our system is able to optimize the Xth percentile latency and throughput for consolidated VMs under SLO constraints.

同时以预定义的百分位数支持延迟SLO(即第x百分位数延迟SLO)和统一VM存储的吞吐量SLO，这是可取的，但具有挑战性。确保第x百分位延迟有助于在应用程序级延迟SLO遵从性度量中准确区分服务级别，特别是对于构建在多个vm上的应用程序。然而，第x百分位延迟SLO和吞吐量SLO实施是同一枚硬币的对立面，因为对IO并发性级别的需求是相互冲突的。为了解决这一挑战，本文提出了PSLO框架，该框架通过精确协调每个VM问题队列的IO并发级别和到达率来支持合并VM环境下的第x百分位延迟和吞吐量slo。值得注意的是，PSLO可以充分利用SLO约束所允许的可用IO容量，以最大的努力提高吞吐量或减少延迟。我们在Xen创建的真实虚拟机整合环境中设计并实现了一个PSLO原型。我们广泛的跟踪驱动原型评估表明，我们的系统能够在SLO约束下优化合并vm的第x百分位延迟和吞吐量。

{"title":"PSLO: enforcing the Xth percentile latency and throughput SLOs for consolidated VM storage","authors":"Ning Li, Hong Jiang, D. Feng, Zhan Shi","doi":"10.1145/2901318.2901330","DOIUrl":"https://doi.org/10.1145/2901318.2901330","url":null,"abstract":"It is desirable but challenging to simultaneously support latency SLO at a pre-defined percentile, i.e., the Xth percentile latency SLO, and throughput SLO for consolidated VM storage. Ensuring the Xth percentile latency contributes to accurately differentiating service levels in the metric of the application-level latency SLO compliance, especially for the application built on multiple VMs. However, the Xth percentile latency SLO and throughput SLO enforcement are the opposite sides of the same coin due to the conflicting requirements for the level of IO concurrency. To address this challenge, this paper proposes PSLO, a framework supporting the Xth percentile latency and throughput SLOs under consolidated VM environment by precisely coordinating the level of IO concurrency and arrival rate for each VM issue queue. It is noted that PSLO can take full advantage of the available IO capacity allowed by SLO constraints to improve throughput or reduce latency with the best effort. We design and implement a PSLO prototype in the real VM consolidation environment created by Xen. Our extensive trace-driven prototype evaluation shows that our system is able to optimize the Xth percentile latency and throughput for consolidated VMs under SLO constraints.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86101352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

vScale: automatic and efficient processor scaling for SMP virtual machines vScale:用于SMP虚拟机的自动高效处理器缩放

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901321

Luwei Cheng, J. Rao, F. Lau

SMP virtual machines (VMs) have been deployed extensively in clouds to host multithreaded applications. A widely known problem is that when CPUs are oversubscribed, the scheduling delays due to VM preemption give rise to many performance problems because of the impact of these delays on thread synchronization and I/O efficiency. Dynamically changing the number of virtual CPUs (vCPUs) by considering the available physical CPU (pCPU) cycles has been shown to be a promising approach. Unfortunately, there are currently no efficient mechanisms to support such vCPU-level elasticity. We present vScale, a cross-layer design to enable SMP-VMs to adaptively scale their vCPUs, at the cost of only microseconds. vScale consists of two extremely light-weight mechanisms: i) a generic algorithm in the hypervisor scheduler to compute VMs' CPU extendability, based on their proportional shares and CPU consumptions, and ii) an efficient method in the guest OS to quickly reconfigure the vCPUs. vScale can be tightly integrated with existing OS/hypervisor primitives and has very little management complexity. With our prototype in Xen/Linux, we evaluate vScale's performance with several representative multithreaded applications, including NPB suite, PARSEC suite and Apache web server. The results show that vScale can significantly reduce the VM's waiting time, and thus can accelerate many applications, especially synchronization-intensive ones and I/O-intensive ones.

SMP虚拟机(vm)已广泛部署在云中，以托管多线程应用程序。一个众所周知的问题是，当cpu被过度订阅时，由于VM抢占而导致的调度延迟会导致许多性能问题，因为这些延迟会影响线程同步和I/O效率。通过考虑可用的物理CPU (pCPU)周期来动态更改虚拟CPU (vcpu)的数量已被证明是一种很有前途的方法。不幸的是，目前还没有有效的机制来支持这种vcpu级别的弹性。我们提出了vScale，一种跨层设计，使smp - vm能够自适应地扩展其vcpu，成本仅为微秒。vScale包含两个非常轻量级的机制:i) hypervisor调度器中的通用算法，根据虚拟机的比例份额和CPU消耗计算虚拟机的CPU可扩展性;ii)客户机操作系统中的有效方法，快速重新配置vcpu。vScale可以与现有的操作系统/虚拟机管理程序紧密集成，并且管理复杂性非常低。在Xen/Linux上，我们用几个代表性的多线程应用程序来评估vScale的性能，包括NPB套件、PARSEC套件和Apache web服务器。结果表明，vScale可以显著减少虚拟机的等待时间，从而可以加速许多应用程序，特别是同步密集型应用程序和I/ o密集型应用程序。

{"title":"vScale: automatic and efficient processor scaling for SMP virtual machines","authors":"Luwei Cheng, J. Rao, F. Lau","doi":"10.1145/2901318.2901321","DOIUrl":"https://doi.org/10.1145/2901318.2901321","url":null,"abstract":"SMP virtual machines (VMs) have been deployed extensively in clouds to host multithreaded applications. A widely known problem is that when CPUs are oversubscribed, the scheduling delays due to VM preemption give rise to many performance problems because of the impact of these delays on thread synchronization and I/O efficiency. Dynamically changing the number of virtual CPUs (vCPUs) by considering the available physical CPU (pCPU) cycles has been shown to be a promising approach. Unfortunately, there are currently no efficient mechanisms to support such vCPU-level elasticity. We present vScale, a cross-layer design to enable SMP-VMs to adaptively scale their vCPUs, at the cost of only microseconds. vScale consists of two extremely light-weight mechanisms: i) a generic algorithm in the hypervisor scheduler to compute VMs' CPU extendability, based on their proportional shares and CPU consumptions, and ii) an efficient method in the guest OS to quickly reconfigure the vCPUs. vScale can be tightly integrated with existing OS/hypervisor primitives and has very little management complexity. With our prototype in Xen/Linux, we evaluate vScale's performance with several representative multithreaded applications, including NPB suite, PARSEC suite and Apache web server. The results show that vScale can significantly reduce the VM's waiting time, and thus can accelerate many applications, especially synchronization-intensive ones and I/O-intensive ones.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73746677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server GeePS:分布式gpu上的可扩展深度学习，带有gpu专用参数服务器

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901323

Henggang Cui, H. Zhang, G. Ganger, Phillip B. Gibbons, E. Xing

Large-scale deep learning requires huge computational resources to train a multi-layer neural network. Recent systems propose using 100s to 1000s of machines to train networks with tens of layers and billions of connections. While the computation involved can be done more efficiently on GPUs than on more traditional CPU cores, training such networks on a single GPU is too slow and training on distributed GPUs can be inefficient, due to data movement overheads, GPU stalls, and limited GPU memory. This paper describes a new parameter server, called GeePS, that supports scalable deep learning across GPUs distributed among multiple machines, overcoming these obstacles. We show that GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single-node code). Moreover, GeePS achieves a higher training throughput with just four GPU machines than that a state-of-the-art CPU-only system achieves with 108 machines.

大规模深度学习需要大量的计算资源来训练多层神经网络。最近的系统建议使用100到1000台机器来训练具有数百层和数十亿连接的网络。虽然涉及的计算在GPU上可以比在更传统的CPU内核上更有效地完成，但在单个GPU上训练这样的网络太慢，并且由于数据移动开销，GPU停滞和有限的GPU内存，在分布式GPU上训练可能效率低下。本文描述了一种新的参数服务器，称为GeePS，它支持跨分布在多台机器之间的gpu的可扩展深度学习，克服了这些障碍。我们表明，GeePS使最先进的单节点GPU实现能够很好地扩展，例如在16台机器上每秒处理的训练图像数量是13倍(相对于原始优化的单节点代码)。此外，GeePS仅使用4台GPU机器就能实现比仅使用108台机器的最先进的cpu系统更高的训练吞吐量。

引用次数: 310

Hold 'em or fold 'em?: aggregation queries under performance variations 拿着还是叠着?:性能变化下的聚合查询

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901351

Gautam Kumar, G. Ananthanarayanan, S. Ratnasamy, I. Stoica

Systems are increasingly required to provide responses to queries, even if not exact, within stringent time deadlines. These systems parallelize computations over many processes and aggregate them hierarchically to get the final response (e.g., search engines and data analytics). Due to large performance variations in clusters, some processes are slower. Therefore, aggregators are faced with the question of how long to wait for outputs from processes before combining and sending them upstream. Longer waits increase the response quality as it would include outputs from more processes. However, it also increases the risk of the aggregator failing to provide its result by the deadline. This leads to all its results being ignored, degrading response quality. Our algorithm, Cedar, proposes a solution to this quandary of deciding wait durations at aggregators. It uses an online algorithm to learn distributions of durations at each level in the hierarchy and collectively optimizes the wait duration. Cedar's solution is theoretically sound, fully distributed, and generically applicable across systems that use aggregation trees since it is agnostic to the causes of performance variations. Evaluation using production latency distributions from Google, Microsoft and Facebook using deployment and simulation shows that Cedar improves average response quality by over 100%.

越来越多的系统需要在严格的时间期限内对查询提供响应，即使不精确。这些系统将许多进程的计算并行化，并分层地将它们聚合起来，以获得最终的响应(例如，搜索引擎和数据分析)。由于集群中的性能差异很大，有些进程会变慢。因此，聚合器面临的问题是，在合并并将它们发送到上游之前，等待进程的输出需要多长时间。较长的等待可以提高响应质量，因为它将包括来自更多流程的输出。然而，这也增加了聚合器未能在截止日期前提供结果的风险。这将导致其所有结果被忽略，从而降低响应质量。我们的算法，Cedar，提出了一个解决方案来决定聚合器的等待持续时间。它使用在线算法来学习层次结构中每个级别的持续时间分布，并共同优化等待持续时间。Cedar的解决方案在理论上是合理的、完全分布式的，并且普遍适用于使用聚合树的系统，因为它不知道性能变化的原因。使用b谷歌、微软和Facebook的生产延迟分布进行部署和模拟的评估表明，Cedar将平均响应质量提高了100%以上。

{"title":"Hold 'em or fold 'em?: aggregation queries under performance variations","authors":"Gautam Kumar, G. Ananthanarayanan, S. Ratnasamy, I. Stoica","doi":"10.1145/2901318.2901351","DOIUrl":"https://doi.org/10.1145/2901318.2901351","url":null,"abstract":"Systems are increasingly required to provide responses to queries, even if not exact, within stringent time deadlines. These systems parallelize computations over many processes and aggregate them hierarchically to get the final response (e.g., search engines and data analytics). Due to large performance variations in clusters, some processes are slower. Therefore, aggregators are faced with the question of how long to wait for outputs from processes before combining and sending them upstream. Longer waits increase the response quality as it would include outputs from more processes. However, it also increases the risk of the aggregator failing to provide its result by the deadline. This leads to all its results being ignored, degrading response quality. Our algorithm, Cedar, proposes a solution to this quandary of deciding wait durations at aggregators. It uses an online algorithm to learn distributions of durations at each level in the hierarchy and collectively optimizes the wait duration. Cedar's solution is theoretically sound, fully distributed, and generically applicable across systems that use aggregation trees since it is agnostic to the causes of performance variations. Evaluation using production latency distributions from Google, Microsoft and Facebook using deployment and simulation shows that Cedar improves average response quality by over 100%.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82476787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

TFC: token flow control in data center networks TFC:数据中心网络中的令牌流控制

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901336

Jiao Zhang, Fengyuan Ren, Ran Shu, Peng Cheng

Services in modern data center networks pose growing performance demands. However, the widely existed special traffic patterns, such as micro-burst, highly concurrent flows, on-off pattern of flow transmission, exacerbate the performance of transport protocols. In this work, an clean-slate explicit transport control mechanism, called Token Flow Control (TFC), is proposed for data center networks to achieve high link utilization, ultra-low latency, fast convergence, and rare packets dropping. TFC uses tokens to represent the link bandwidth resource and define the concept of effective flows to stand for consumers. The total tokens will be explicitly allocated to each consumer every time slot. TFC excludes in-network buffer space from the flow pipeline and thus achieves zero-queueing. Besides, a packet delay function is added at switches to prevent packets dropping with highly concurrent flows. The performance of TFC is evaluated using both experiments on a small real testbed and large-scale simulations. The results show that TFC achieves high throughput, fast convergence, near zero-queuing and rare packets loss in various scenarios.

现代数据中心网络中的业务对性能的要求越来越高。然而，广泛存在的特殊流量模式，如微突发、高并发流、流传输的开关模式等，使传输协议的性能恶化。在这项工作中，提出了一种全新的显式传输控制机制，称为令牌流控制(TFC)，用于数据中心网络，以实现高链路利用率，超低延迟，快速收敛和罕见的数据包丢失。TFC使用令牌来表示链路带宽资源，并定义有效流的概念来代表消费者。总令牌将在每个时隙显式地分配给每个消费者。TFC从流管道中排除网络内缓冲空间，从而实现零排队。同时在交换机上增加了报文延迟功能，防止高并发流丢包。通过小型真实试验台实验和大规模仿真对TFC的性能进行了评价。结果表明，TFC在各种场景下都实现了高吞吐量、快速收敛、近零排队和极少丢包等特点。

引用次数: 27