首页 > 最新文献

Proceedings of the Eleventh European Conference on Computer Systems最新文献

英文 中文
On the capacity of thermal covert channels in multicores 多核热隐蔽信道容量研究
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901322
D. Bartolini, Philipp Miedl, L. Thiele
Modern multicore processors feature easily accessible temperature sensors that provide useful information for dynamic thermal management. These sensors were recently shown to be a potential security threat, since otherwise isolated applications can exploit them to establish a thermal covert channel and leak restricted information. Previous research showed experiments that document the feasibility of (low-rate) communication over this channel, but did not further analyze its fundamental characteristics. For this reason, the important questions of quantifying the channel capacity and achievable rates remain unanswered. To address these questions, we devise and exploit a new methodology that leverages both theoretical results from information theory and experimental data to study these thermal covert channels on modern multicores. We use spectral techniques to analyze data from two representative platforms and estimate the capacity of the channels from a source application to temperature sensors on the same or different cores. We estimate the capacity to be in the order of 300 bits per second (bps) for the same-core channel, i.e., when reading the temperature on the same core where the source application runs, and in the order of 50 bps for the 1-hop channel, i.e., when reading the temperature of the core physically next to the one where the source application runs. Moreover, we show a communication scheme that achieves rates of more than 45 bps on the same-core channel and more than 5 bps on the 1-hop channel, with less than 1% error probability. The highest rate shown in previous work was 1.33 bps on the 1-hop channel with 11% error probability.
现代多核处理器具有易于访问的温度传感器,为动态热管理提供有用的信息。这些传感器最近被证明是一个潜在的安全威胁,因为否则孤立的应用程序可以利用它们来建立热隐蔽通道并泄露受限制的信息。先前的研究表明,实验证明了在该信道上(低速率)通信的可行性,但没有进一步分析其基本特性。由于这个原因,量化信道容量和可实现速率的重要问题仍然没有答案。为了解决这些问题,我们设计并开发了一种新的方法,利用信息理论和实验数据的理论结果来研究现代多核上的这些热隐蔽通道。我们使用光谱技术来分析来自两个代表性平台的数据,并估计从源应用程序到相同或不同核心上温度传感器的通道容量。我们估计同核通道的容量为每秒300比特(bps),即在读取源应用程序运行的同一核上的温度时,对于1跳通道的容量为每秒50比特,即在读取源应用程序运行的核旁边的核的物理温度时。此外,我们还展示了一种通信方案,该方案在同核信道上实现了超过45 bps的速率,在1跳信道上实现了超过5 bps的速率,并且错误概率小于1%。在先前的工作中显示的最高速率在1跳信道上为1.33 bps,错误概率为11%。
{"title":"On the capacity of thermal covert channels in multicores","authors":"D. Bartolini, Philipp Miedl, L. Thiele","doi":"10.1145/2901318.2901322","DOIUrl":"https://doi.org/10.1145/2901318.2901322","url":null,"abstract":"Modern multicore processors feature easily accessible temperature sensors that provide useful information for dynamic thermal management. These sensors were recently shown to be a potential security threat, since otherwise isolated applications can exploit them to establish a thermal covert channel and leak restricted information. Previous research showed experiments that document the feasibility of (low-rate) communication over this channel, but did not further analyze its fundamental characteristics. For this reason, the important questions of quantifying the channel capacity and achievable rates remain unanswered. To address these questions, we devise and exploit a new methodology that leverages both theoretical results from information theory and experimental data to study these thermal covert channels on modern multicores. We use spectral techniques to analyze data from two representative platforms and estimate the capacity of the channels from a source application to temperature sensors on the same or different cores. We estimate the capacity to be in the order of 300 bits per second (bps) for the same-core channel, i.e., when reading the temperature on the same core where the source application runs, and in the order of 50 bps for the 1-hop channel, i.e., when reading the temperature of the core physically next to the one where the source application runs. Moreover, we show a communication scheme that achieves rates of more than 45 bps on the same-core channel and more than 5 bps on the 1-hop channel, with less than 1% error probability. The highest rate shown in previous work was 1.33 bps on the 1-hop channel with 11% error probability.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90266094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Yoda: a highly available layer-7 load balancer Yoda:一个高可用的第7层负载均衡器
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901352
Rohan Gandhi, Charlie Hu, Ming Zhang
Layer-7 load balancing is a foundational building block of online services. The lack of offerings from major public cloud providers have left online services to build their own load balancers (LB), or use third-party LB design such as HAProxy. The key problem with such proxy-based design is each proxy instance is a single point of failure, as upon its failure, the TCP flow state for the connections with the client and server is lost which breaks the user flows. This significantly affects user experience and online services revenue. In this paper, we present Yoda, a highly available, scalable and low-latency L7-LB-as-a-service in a public cloud. Yoda is based on two design principles we propose for achieving high availability of a L7 LB: decoupling the flow state from the LB instances and storing it in a persistent storage, and leveraging the L4 LB service to enable each L7 LB instance to use the virtual IP in interacting with both the client and the server (called front-and-back indirection). Our evaluation of Yoda prototype on a 60-VM testbed in Windows Azure shows the overhead of decoupling TCP state into a persistent storage is very low (<1 msec), and Yoda maintains all flows during LB instance failures, addition, removal, as well as user policy updates. Our simulation driven by a one-day trace from production online services show that compared to using Yoda by each tenant, Yoda-as-a-service reduces L7 LB instance cost for the tenants by 3.7x while providing 4x more redundancy.
第7层负载平衡是在线服务的基本组成部分。由于缺乏主要公共云提供商提供的产品,在线服务只能构建自己的负载平衡器(LB),或者使用HAProxy等第三方负载平衡器设计。这种基于代理的设计的关键问题是每个代理实例都是一个单点故障,因为在它发生故障时,与客户端和服务器连接的TCP流状态将丢失,从而破坏用户流。这将严重影响用户体验和在线服务收入。在本文中,我们介绍了Yoda,这是一个公共云中高可用、可扩展和低延迟的l7 - lb即服务。Yoda基于我们为实现L7 LB的高可用性而提出的两个设计原则:将流状态与LB实例解耦并将其存储在持久存储中,并利用L4 LB服务使每个L7 LB实例能够使用虚拟IP与客户端和服务器交互(称为前后间接)。我们在Windows Azure的60个vm测试台上对Yoda原型进行了评估,结果显示,将TCP状态解绑定到持久存储的开销非常低(<1毫秒),并且Yoda在LB实例失败、添加、删除以及用户策略更新期间维护所有流。通过对生产在线服务进行为期一天的跟踪,我们的模拟显示,与每个租户使用Yoda相比,Yoda即服务将租户的L7 LB实例成本降低了3.7倍,同时提供了4倍的冗余。
{"title":"Yoda: a highly available layer-7 load balancer","authors":"Rohan Gandhi, Charlie Hu, Ming Zhang","doi":"10.1145/2901318.2901352","DOIUrl":"https://doi.org/10.1145/2901318.2901352","url":null,"abstract":"Layer-7 load balancing is a foundational building block of online services. The lack of offerings from major public cloud providers have left online services to build their own load balancers (LB), or use third-party LB design such as HAProxy. The key problem with such proxy-based design is each proxy instance is a single point of failure, as upon its failure, the TCP flow state for the connections with the client and server is lost which breaks the user flows. This significantly affects user experience and online services revenue. In this paper, we present Yoda, a highly available, scalable and low-latency L7-LB-as-a-service in a public cloud. Yoda is based on two design principles we propose for achieving high availability of a L7 LB: decoupling the flow state from the LB instances and storing it in a persistent storage, and leveraging the L4 LB service to enable each L7 LB instance to use the virtual IP in interacting with both the client and the server (called front-and-back indirection). Our evaluation of Yoda prototype on a 60-VM testbed in Windows Azure shows the overhead of decoupling TCP state into a persistent storage is very low (<1 msec), and Yoda maintains all flows during LB instance failures, addition, removal, as well as user policy updates. Our simulation driven by a one-day trace from production online services show that compared to using Yoda by each tenant, Yoda-as-a-service reduces L7 LB instance cost for the tenants by 3.7x while providing 4x more redundancy.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88148415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
HAFT: hardware-assisted fault tolerance 硬件辅助容错
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901339
Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, P. Felber, C. Fetzer
Transient hardware faults during the execution of a program can cause data corruptions. We present HAFT, a fault tolerance technique using hardware extensions of commodity CPUs to protect unmodified multithreaded applications against such corruptions. HAFT utilizes instruction-level redundancy for fault detection and hardware transactional memory for fault recovery. We evaluated HAFT with Phoenix and PARSEC benchmarks. The observed normalized runtime is 2x, with 98.9% of the injected data corruptions being detected and 91.2% being corrected. To demonstrate the effectiveness of HAFT, we applied it to real-world case studies including Memcached, Apache, and SQLite.
程序执行过程中的暂时性硬件故障可能导致数据损坏。我们介绍了HAFT,这是一种容错技术,使用商用cpu的硬件扩展来保护未修改的多线程应用程序免受此类损坏。HAFT利用指令级冗余进行故障检测,利用硬件事务性内存进行故障恢复。我们用Phoenix和PARSEC基准对HAFT进行了评估。观察到的规范化运行时是2倍,98.9%的注入数据损坏被检测到,91.2%被纠正。为了演示HAFT的有效性,我们将其应用于实际案例研究,包括Memcached、Apache和SQLite。
{"title":"HAFT: hardware-assisted fault tolerance","authors":"Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, P. Felber, C. Fetzer","doi":"10.1145/2901318.2901339","DOIUrl":"https://doi.org/10.1145/2901318.2901339","url":null,"abstract":"Transient hardware faults during the execution of a program can cause data corruptions. We present HAFT, a fault tolerance technique using hardware extensions of commodity CPUs to protect unmodified multithreaded applications against such corruptions. HAFT utilizes instruction-level redundancy for fault detection and hardware transactional memory for fault recovery. We evaluated HAFT with Phoenix and PARSEC benchmarks. The observed normalized runtime is 2x, with 98.9% of the injected data corruptions being detected and 91.2% being corrected. To demonstrate the effectiveness of HAFT, we applied it to real-world case studies including Memcached, Apache, and SQLite.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79435542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
POSIX abstractions in modern operating systems: the old, the new, and the missing 现代操作系统中的POSIX抽象:旧的、新的和缺失的
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901350
Vaggelis Atlidakis, Jeremy Andrus, Roxana Geambasu, Dimitris Mitropoulos, Jason Nieh
The POSIX standard, developed 25 years ago, comprises a set of operating system (OS) abstractions that aid application portability across UNIX-based OSes. While OSes and applications have evolved tremendously over the last 25 years, POSIX, and the basic set of abstractions it provides, has remained largely unchanged. Little has been done to measure how and to what extent traditional POSIX abstractions are being used in modern OSes, and whether new abstractions are taking form, dethroning traditional ones. We explore these questions through a study of POSIX usage in modern desktop and mobile OSes: Android, OS X, and Ubuntu. Our results show that new abstractions are taking form, replacing several prominent traditional abstractions in POSIX. While the changes are driven by common needs and are conceptually similar across the three OSes, they are not converging on any new standard, increasing fragmentation.
POSIX标准是25年前开发的,包含一组操作系统(OS)抽象,这些抽象有助于应用程序在基于unix的操作系统之间的可移植性。虽然操作系统和应用程序在过去25年里发生了巨大的变化,但POSIX及其提供的基本抽象集在很大程度上保持不变。传统的POSIX抽象在现代操作系统中的使用方式和程度,以及新的抽象是否正在形成,取代传统的抽象,几乎没有人做过衡量。我们通过研究POSIX在现代桌面和移动操作系统(Android、OS X和Ubuntu)中的使用情况来探讨这些问题。我们的结果表明,新的抽象正在形成,取代了POSIX中几个突出的传统抽象。虽然这些变化是由共同的需求驱动的,并且在三个操作系统之间在概念上是相似的,但它们并没有融合到任何新的标准上,从而增加了碎片化。
{"title":"POSIX abstractions in modern operating systems: the old, the new, and the missing","authors":"Vaggelis Atlidakis, Jeremy Andrus, Roxana Geambasu, Dimitris Mitropoulos, Jason Nieh","doi":"10.1145/2901318.2901350","DOIUrl":"https://doi.org/10.1145/2901318.2901350","url":null,"abstract":"The POSIX standard, developed 25 years ago, comprises a set of operating system (OS) abstractions that aid application portability across UNIX-based OSes. While OSes and applications have evolved tremendously over the last 25 years, POSIX, and the basic set of abstractions it provides, has remained largely unchanged. Little has been done to measure how and to what extent traditional POSIX abstractions are being used in modern OSes, and whether new abstractions are taking form, dethroning traditional ones. We explore these questions through a study of POSIX usage in modern desktop and mobile OSes: Android, OS X, and Ubuntu. Our results show that new abstractions are taking form, replacing several prominent traditional abstractions in POSIX. While the changes are driven by common needs and are conceptually similar across the three OSes, they are not converging on any new standard, increasing fragmentation.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89900189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Juggler: a practical reordering resilient network stack for datacenters 用于数据中心的实用的重新排序弹性网络堆栈
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901334
Yilong Geng, V. Jeyakumar, A. Kabbani, Mohammad Alizadeh
We present Juggler, a practical reordering resilient network stack for datacenters that enables any packet to be sent on any path at any level of priority. Juggler adds functionality to the Generic Receive Offload layer at the entry of the network stack to put packets in order in a best-effort fashion. Juggler's design exploits the small packet delays in datacenter networks and the inherent burstiness of traffic to eliminate the negative effects of packet reordering almost entirely while keeping state for only a small number of flows at any given time. Extensive testbed experiments at 10Gb/s and 40Gb/s speeds show that Juggler is effective and lightweight: it prevents performance loss even with severe packet reordering while imposing low CPU overhead. We demonstrate the use of Juggler for per-packet multi-path load balancing and a novel system that provides bandwidth guarantees by dynamically prioritizing packets.
我们介绍了Juggler,这是一个用于数据中心的实用的重新排序弹性网络堆栈,它允许在任何优先级的任何路径上发送任何数据包。在网络堆栈的入口处,Juggler向通用接收卸载层添加了功能,以最佳方式将数据包按顺序排列。Juggler的设计利用了数据中心网络中的小数据包延迟和固有的流量突发性,几乎完全消除了数据包重新排序的负面影响,同时在任何给定时间仅保持少量流的状态。在10Gb/s和40Gb/s速度下的大量测试实验表明,Juggler是有效且轻量级的:即使在严重的数据包重新排序时,它也可以防止性能损失,同时施加较低的CPU开销。我们演示了使用Juggler实现每个包的多路径负载平衡,以及一个通过动态确定包的优先级来提供带宽保证的新系统。
{"title":"Juggler: a practical reordering resilient network stack for datacenters","authors":"Yilong Geng, V. Jeyakumar, A. Kabbani, Mohammad Alizadeh","doi":"10.1145/2901318.2901334","DOIUrl":"https://doi.org/10.1145/2901318.2901334","url":null,"abstract":"We present Juggler, a practical reordering resilient network stack for datacenters that enables any packet to be sent on any path at any level of priority. Juggler adds functionality to the Generic Receive Offload layer at the entry of the network stack to put packets in order in a best-effort fashion. Juggler's design exploits the small packet delays in datacenter networks and the inherent burstiness of traffic to eliminate the negative effects of packet reordering almost entirely while keeping state for only a small number of flows at any given time. Extensive testbed experiments at 10Gb/s and 40Gb/s speeds show that Juggler is effective and lightweight: it prevents performance loss even with severe packet reordering while imposing low CPU overhead. We demonstrate the use of Juggler for per-packet multi-path load balancing and a novel system that provides bandwidth guarantees by dynamically prioritizing packets.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79305435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Flash storage disaggregation 闪存分解
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901337
Ana Klimovic, C. Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar
PCIe-based Flash is commonly deployed to provide datacenter applications with high IO rates. However, its capacity and bandwidth are often underutilized as it is difficult to design servers with the right balance of CPU, memory and Flash resources over time and for multiple applications. This work examines Flash disaggregation as a way to deal with Flash overprovisioning. We tune remote access to Flash over commodity networks and analyze its impact on workloads sampled from real datacenter applications. We show that, while remote Flash access introduces a 20% throughput drop at the application level, disaggregation allows us to make up for these overheads through resource-efficient scale-out. Hence, we show that Flash disaggregation allows scaling CPU and Flash resources independently in a cost effective manner. We use our analysis to draw conclusions about data and control plane issues in remote storage.
基于pcie的闪存通常用于为数据中心应用程序提供高IO速率。然而,它的容量和带宽往往没有得到充分利用,因为很难设计出CPU、内存和闪存资源随着时间的推移和多个应用程序的适当平衡的服务器。这项工作研究了Flash分解作为处理Flash过度配置的一种方法。我们通过商品网络调整对Flash的远程访问,并分析其对从真实数据中心应用程序中采样的工作负载的影响。我们表明,虽然远程Flash访问在应用程序级别上引入了20%的吞吐量下降,但分解允许我们通过资源高效的横向扩展来弥补这些开销。因此,我们表明,Flash分解允许缩放CPU和Flash资源独立的成本有效的方式。我们利用我们的分析得出了关于远程存储中的数据和控制平面问题的结论。
{"title":"Flash storage disaggregation","authors":"Ana Klimovic, C. Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar","doi":"10.1145/2901318.2901337","DOIUrl":"https://doi.org/10.1145/2901318.2901337","url":null,"abstract":"PCIe-based Flash is commonly deployed to provide datacenter applications with high IO rates. However, its capacity and bandwidth are often underutilized as it is difficult to design servers with the right balance of CPU, memory and Flash resources over time and for multiple applications. This work examines Flash disaggregation as a way to deal with Flash overprovisioning. We tune remote access to Flash over commodity networks and analyze its impact on workloads sampled from real datacenter applications. We show that, while remote Flash access introduces a 20% throughput drop at the application level, disaggregation allows us to make up for these overheads through resource-efficient scale-out. Hence, we show that Flash disaggregation allows scaling CPU and Flash resources independently in a cost effective manner. We use our analysis to draw conclusions about data and control plane issues in remote storage.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78089422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 123
STRADS: a distributed framework for scheduled model parallel machine learning STRADS:调度模型并行机器学习的分布式框架
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901331
Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, E. Xing
Machine learning (ML) algorithms are commonly applied to big data, using distributed systems that partition the data across machines and allow each machine to read and update all ML model parameters --- a strategy known as data parallelism. An alternative and complimentary strategy, model parallelism, partitions the model parameters for non-shared parallel access and updates, and may periodically repartition the parameters to facilitate communication. Model parallelism is motivated by two challenges that data-parallelism does not usually address: (1) parameters may be dependent, thus naive concurrent updates can introduce errors that slow convergence or even cause algorithm failure; (2) model parameters converge at different rates, thus a small subset of parameters can bottleneck ML algorithm completion. We propose scheduled model parallelism (SchMP), a programming approach that improves ML algorithm convergence speed by efficiently scheduling parameter updates, taking into account parameter dependencies and uneven convergence. To support SchMP at scale, we develop a distributed framework STRADS which optimizes the throughput of SchMP programs, and benchmark four common ML applications written as SchMP programs: LDA topic modeling, matrix factorization, sparse least-squares (Lasso) regression and sparse logistic regression. By improving ML progress per iteration through SchMP programming whilst improving iteration throughput through STRADS we show that SchMP programs running on STRADS outperform non-model-parallel ML implementations: for example, SchMP LDA and SchMP Lasso respectively achieve 10x and 5x faster convergence than recent, well-established baselines.
机器学习(ML)算法通常应用于大数据,使用分布式系统将数据跨机器分区,并允许每台机器读取和更新所有ML模型参数——一种称为数据并行的策略。另一种补充策略是模型并行性,它将模型参数划分为非共享并行访问和更新,并可以定期重新划分参数以促进通信。模型并行是由数据并行通常无法解决的两个挑战驱动的:(1)参数可能是相互依赖的,因此幼稚的并发更新可能引入错误,减慢收敛速度甚至导致算法失败;(2)模型参数收敛速度不同,一小部分参数可能成为ML算法完成的瓶颈。我们提出了调度模型并行(SchMP),这是一种编程方法,通过有效地调度参数更新来提高ML算法的收敛速度,同时考虑了参数依赖性和不均匀收敛性。为了大规模支持SchMP,我们开发了一个分布式框架STRADS来优化SchMP程序的吞吐量,并对四种常见的ML应用程序进行了基准测试:LDA主题建模、矩阵分解、稀疏最小二乘(Lasso)回归和稀疏逻辑回归。通过SchMP编程提高每次迭代的机器学习进度,同时通过STRADS提高迭代吞吐量,我们发现在STRADS上运行的SchMP程序优于非模型并行的机器学习实现:例如,SchMP LDA和SchMP Lasso分别比最近建立的基线快10倍和5倍。
{"title":"STRADS: a distributed framework for scheduled model parallel machine learning","authors":"Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, E. Xing","doi":"10.1145/2901318.2901331","DOIUrl":"https://doi.org/10.1145/2901318.2901331","url":null,"abstract":"Machine learning (ML) algorithms are commonly applied to big data, using distributed systems that partition the data across machines and allow each machine to read and update all ML model parameters --- a strategy known as data parallelism. An alternative and complimentary strategy, model parallelism, partitions the model parameters for non-shared parallel access and updates, and may periodically repartition the parameters to facilitate communication. Model parallelism is motivated by two challenges that data-parallelism does not usually address: (1) parameters may be dependent, thus naive concurrent updates can introduce errors that slow convergence or even cause algorithm failure; (2) model parameters converge at different rates, thus a small subset of parameters can bottleneck ML algorithm completion. We propose scheduled model parallelism (SchMP), a programming approach that improves ML algorithm convergence speed by efficiently scheduling parameter updates, taking into account parameter dependencies and uneven convergence. To support SchMP at scale, we develop a distributed framework STRADS which optimizes the throughput of SchMP programs, and benchmark four common ML applications written as SchMP programs: LDA topic modeling, matrix factorization, sparse least-squares (Lasso) regression and sparse logistic regression. By improving ML progress per iteration through SchMP programming whilst improving iteration throughput through STRADS we show that SchMP programs running on STRADS outperform non-model-parallel ML implementations: for example, SchMP LDA and SchMP Lasso respectively achieve 10x and 5x faster convergence than recent, well-established baselines.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72985651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
NChecker: saving mobile app developers from network disruptions NChecker:将移动应用开发者从网络中断中解救出来
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901353
Xinxin Jin, Peng Huang, Tianyin Xu, Yuanyuan Zhou
Most of today's mobile apps rely on the underlying networks to deliver key functions such as web browsing, file synchronization, and social networking. Compared to desktop-based networks, mobile networks are much more dynamic with frequent connectivity disruptions, network type switches, and quality changes, posing unique programming challenges for mobile app developers. As revealed in this paper, many mobile app developers fail to handle these intermittent network conditions in the mobile network programming. Consequently, network programming defects (NPDs) are pervasive in mobile apps, causing bad user experiences such as crashes, data loss, etc. Despite the development of network libraries in the hope of lifting the developers' burden, we observe that many app developers fail to use these libraries properly and still introduce NPDs. In this paper, we study the characteristics of the real-world NPDs in Android apps towards a deep understanding of their impacts, root causes, and code patterns. Driven by the study, we build NChecker, a practical tool to detect NPDs by statically analyzing Android app binaries. NChecker has been applied to hundreds of real Android apps and detected 4180 NPDs from 285 randomly-selected apps with a 94+% accuracy. Our further analysis of these defects reveals the common mistakes of app developers in working with the existing network libraries' abstractions, which provide insights for improving the usability of mobile network libraries.
今天的大多数移动应用程序都依赖底层网络来提供关键功能,如网页浏览、文件同步和社交网络。与基于桌面的网络相比,移动网络更加动态,频繁出现连接中断、网络类型切换和质量变化,给移动应用开发者带来了独特的编程挑战。正如本文所揭示的,许多移动应用程序开发人员在移动网络编程中没有处理好这些间歇性的网络情况。因此,网络编程缺陷(network programming defects, npd)在移动应用中普遍存在,导致崩溃、数据丢失等不良用户体验。尽管网络库的发展希望减轻开发者的负担,但我们观察到许多应用开发者没有正确使用这些库,仍然引入npd。在本文中,我们研究了现实世界中Android应用中npd的特征,以深入了解其影响,根本原因和代码模式。在这项研究的推动下,我们构建了NChecker,这是一个通过静态分析Android应用程序二进制文件来检测npd的实用工具。NChecker已经应用于数百个真实的Android应用程序,并从285个随机选择的应用程序中检测出4180个npd,准确率为94%以上。我们对这些缺陷的进一步分析揭示了应用程序开发人员在使用现有网络库抽象时的常见错误,为提高移动网络库的可用性提供了见解。
{"title":"NChecker: saving mobile app developers from network disruptions","authors":"Xinxin Jin, Peng Huang, Tianyin Xu, Yuanyuan Zhou","doi":"10.1145/2901318.2901353","DOIUrl":"https://doi.org/10.1145/2901318.2901353","url":null,"abstract":"Most of today's mobile apps rely on the underlying networks to deliver key functions such as web browsing, file synchronization, and social networking. Compared to desktop-based networks, mobile networks are much more dynamic with frequent connectivity disruptions, network type switches, and quality changes, posing unique programming challenges for mobile app developers. As revealed in this paper, many mobile app developers fail to handle these intermittent network conditions in the mobile network programming. Consequently, network programming defects (NPDs) are pervasive in mobile apps, causing bad user experiences such as crashes, data loss, etc. Despite the development of network libraries in the hope of lifting the developers' burden, we observe that many app developers fail to use these libraries properly and still introduce NPDs. In this paper, we study the characteristics of the real-world NPDs in Android apps towards a deep understanding of their impacts, root causes, and code patterns. Driven by the study, we build NChecker, a practical tool to detect NPDs by statically analyzing Android app binaries. NChecker has been applied to hundreds of real Android apps and detected 4180 NPDs from 285 randomly-selected apps with a 94+% accuracy. Our further analysis of these defects reveals the common mistakes of app developers in working with the existing network libraries' abstractions, which provide insights for improving the usability of mobile network libraries.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73731924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Optimizing distributed actor systems for dynamic interactive services 为动态交互服务优化分布式参与者系统
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901343
Andrew Newell, G. Kliot, Ishai Menache, Aditya Gopalan, Soramichi Akiyama, M. Silberstein
Distributed actor systems are widely used for developing interactive scalable cloud services, such as social networks and on-line games. By modeling an application as a dynamic set of lightweight communicating "actors", developers can easily build complex distributed applications, while the underlying runtime system deals with low-level complexities of a distributed environment. We present ActOp---a data-driven, application-independent runtime mechanism for optimizing end-to-end service latency of actor-based distributed applications. ActOp targets the two dominant factors affecting latency: the overhead of remote inter-actor communications across servers, and the intra-server queuing delay. ActOp automatically identifies frequently communicating actors and migrates them to the same server transparently to the running application. The migration decisions are driven by a novel scalable distributed graph partitioning algorithm which does not rely on a single server to store the whole communication graph, thereby enabling efficient actor placement even for applications with rapidly changing graphs (e.g., chat services). Further, each server autonomously reduces the queuing delay by learning an internal queuing model and configuring threads according to instantaneous request rate and application demands. We prototype ActOp by integrating it with Orleans -- a popular open-source actor system [4, 13]. Experiments with realistic workloads show latency improvements of up to 75% for the 99th percentile, up to 63% for the mean, with up to 2x increase in peak system throughput.
分布式参与者系统广泛用于开发交互式可伸缩云服务,如社交网络和在线游戏。通过将应用程序建模为一组动态的轻量级通信“参与者”,开发人员可以轻松地构建复杂的分布式应用程序,而底层运行时系统则处理分布式环境的低级复杂性。我们提出了ActOp——一个数据驱动的、独立于应用程序的运行时机制,用于优化基于参与者的分布式应用程序的端到端服务延迟。ActOp针对影响延迟的两个主要因素:跨服务器的远程参与者间通信的开销和服务器内部排队延迟。ActOp自动识别频繁通信的参与者,并将它们透明地迁移到运行中的应用程序的同一服务器上。迁移决策是由一种新颖的可扩展分布式图分区算法驱动的,该算法不依赖于单个服务器来存储整个通信图,因此即使对于具有快速变化图的应用程序(例如,聊天服务),也可以有效地放置参与者。此外,每个服务器通过学习内部队列模型和根据瞬时请求率和应用程序需求配置线程来自主地减少队列延迟。我们通过将ActOp与一个流行的开源actor系统Orleans(4,13)集成来构建ActOp原型。对实际工作负载进行的实验表明,在第99个百分位数中,延迟提高了75%,平均提高了63%,峰值系统吞吐量提高了2倍。
{"title":"Optimizing distributed actor systems for dynamic interactive services","authors":"Andrew Newell, G. Kliot, Ishai Menache, Aditya Gopalan, Soramichi Akiyama, M. Silberstein","doi":"10.1145/2901318.2901343","DOIUrl":"https://doi.org/10.1145/2901318.2901343","url":null,"abstract":"Distributed actor systems are widely used for developing interactive scalable cloud services, such as social networks and on-line games. By modeling an application as a dynamic set of lightweight communicating \"actors\", developers can easily build complex distributed applications, while the underlying runtime system deals with low-level complexities of a distributed environment. We present ActOp---a data-driven, application-independent runtime mechanism for optimizing end-to-end service latency of actor-based distributed applications. ActOp targets the two dominant factors affecting latency: the overhead of remote inter-actor communications across servers, and the intra-server queuing delay. ActOp automatically identifies frequently communicating actors and migrates them to the same server transparently to the running application. The migration decisions are driven by a novel scalable distributed graph partitioning algorithm which does not rely on a single server to store the whole communication graph, thereby enabling efficient actor placement even for applications with rapidly changing graphs (e.g., chat services). Further, each server autonomously reduces the queuing delay by learning an internal queuing model and configuring threads according to instantaneous request rate and application demands. We prototype ActOp by integrating it with Orleans -- a popular open-source actor system [4, 13]. Experiments with realistic workloads show latency improvements of up to 75% for the 99th percentile, up to 63% for the mean, with up to 2x increase in peak system throughput.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74218158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Type-aware transactions for faster concurrent code 支持类型感知的事务,以实现更快的并发代码
Pub Date : 2016-04-18 DOI: 10.1145/2901318.2901348
Nathaniel Herman, J. Inala, Yihe Huang, Lillian Tsai, E. Kohler, B. Liskov, L. Shrira
It is often possible to improve a concurrent system's performance by leveraging the semantics of its datatypes. We build a new software transactional memory (STM) around this observation. A conventional STM tracks read- and write-sets of memory words; even simple operations can generate large sets. Our STM, which we call STO, tracks abstract operations on transactional datatypes instead. Parts of the transactional commit protocol are delegated to these datatypes' implementations, which can use datatype semantics, and new commit protocol features, to reduce bookkeeping, limit false conflicts, and implement efficient concurrency control. We test these ideas on the STAMP benchmark suite for STM applications and on our own prior work, the Silo high-performance in-memory database, observing large performance improvements in both systems.
通常可以通过利用数据类型的语义来提高并发系统的性能。我们围绕这一观察构建了一个新的软件事务性内存(STM)。传统的STM跟踪存储器字的读写集;即使是简单的操作也可以生成大的集合。我们的STM(我们称之为STO)跟踪事务性数据类型上的抽象操作。事务性提交协议的部分内容被委托给这些数据类型的实现,这些实现可以使用数据类型语义和新的提交协议特性来减少簿记、限制错误冲突并实现有效的并发控制。我们在STM应用程序的STAMP基准套件和我们自己之前的工作Silo高性能内存数据库上测试了这些想法,观察到两个系统都有很大的性能改进。
{"title":"Type-aware transactions for faster concurrent code","authors":"Nathaniel Herman, J. Inala, Yihe Huang, Lillian Tsai, E. Kohler, B. Liskov, L. Shrira","doi":"10.1145/2901318.2901348","DOIUrl":"https://doi.org/10.1145/2901318.2901348","url":null,"abstract":"It is often possible to improve a concurrent system's performance by leveraging the semantics of its datatypes. We build a new software transactional memory (STM) around this observation. A conventional STM tracks read- and write-sets of memory words; even simple operations can generate large sets. Our STM, which we call STO, tracks abstract operations on transactional datatypes instead. Parts of the transactional commit protocol are delegated to these datatypes' implementations, which can use datatype semantics, and new commit protocol features, to reduce bookkeeping, limit false conflicts, and implement efficient concurrency control. We test these ideas on the STAMP benchmark suite for STM applications and on our own prior work, the Silo high-performance in-memory database, observing large performance improvements in both systems.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86259525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
期刊
Proceedings of the Eleventh European Conference on Computer Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1