ACM Transactions on Computer Systems最新文献_第8页

Selective replication: A lightweight technique for soft errors 选择性复制:针对软错误的轻量级技术

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2009-12-01 DOI: 10.1145/1658357.1658359

X. Vera, J. Abella, J. Carretero, Antonio González

Soft errors are an important challenge in contemporary microprocessors. Modern processors have caches and large memory arrays protected by parity or error detection and correction codes. However, today's failure rate is dominated by flip flops, latches, and the increasing sensitivity of combinational logic to particle strikes. Moreover, as Chip Multi-Processors (CMPs) become ubiquitous, meeting the FIT budget for new designs is becoming a major challenge. Solutions based on replicating threads have been explored deeply; however, their high cost in performance and energy make them unsuitable for current designs. Moreover, our studies based on a typical configuration for a modern processor show that focusing on the top 5 most vulnerable structures can provide up to 70% reduction in FIT rate. Therefore, full replication may overprotect the chip by reducing the FIT much below budget. We propose Selective Replication, a lightweight-reconfigurable mechanism that achieves a high FIT reduction by protecting the most vulnerable instructions with minimal performance and energy impact. Low performance degradation is achieved by not requiring additional issue slots and reissuing instructions only during the time window between when they are retirable and they actually retire. Coverage can be reconfigured online by replicating only a subset of the instructions (the most vulnerable ones). Instructions' vulnerability is estimated based on the area they occupy and the time they spend in the issue queue. By changing the vulnerability threshold, we can adjust the trade-off between coverage and performance loss. Results for an out-of-order processor configured similarly to Intel® Core#8482; Micro-Architecture show that our scheme can achieve over 65% FIT reduction with less than 4% performance degradation with small area and complexity overhead.

软错误是当代微处理器面临的一个重要挑战。现代处理器有缓存和大内存阵列，由奇偶校验或错误检测和纠错码保护。然而，今天的故障率主要是由触发器、锁存器和组合逻辑对粒子撞击的日益敏感所决定的。此外，随着芯片多处理器(cmp)的普及，满足新设计的FIT预算正成为一项重大挑战。基于复制线程的解决方案已被深入探索;然而，它们在性能和能源上的高成本使它们不适合当前的设计。此外，我们基于现代处理器典型配置的研究表明，专注于前5个最脆弱的结构可以提供高达70%的FIT率降低。因此，完全复制可能会通过将FIT降低到远远低于预算的水平来过度保护芯片。我们提出了选择性复制，这是一种轻量级可重构机制，通过以最小的性能和能量影响保护最脆弱的指令来实现高FIT降低。通过不需要额外的问题插槽，并且仅在可退役和实际退役之间的时间窗口内重新发布指令，可以实现低性能下降。覆盖范围可以通过只复制指令的子集(最脆弱的指令)来在线重新配置。根据指令占用的区域和在问题队列中花费的时间来估计指令的脆弱性。通过更改漏洞阈值，我们可以调整覆盖率和性能损失之间的权衡。与Intel®Core#8482配置类似的乱序处理器的结果;微架构表明，我们的方案可以在较小的面积和复杂性开销下实现65%以上的FIT降低，性能下降不到4%。

{"title":"Selective replication: A lightweight technique for soft errors","authors":"X. Vera, J. Abella, J. Carretero, Antonio González","doi":"10.1145/1658357.1658359","DOIUrl":"https://doi.org/10.1145/1658357.1658359","url":null,"abstract":"Soft errors are an important challenge in contemporary microprocessors. Modern processors have caches and large memory arrays protected by parity or error detection and correction codes. However, today's failure rate is dominated by flip flops, latches, and the increasing sensitivity of combinational logic to particle strikes. Moreover, as Chip Multi-Processors (CMPs) become ubiquitous, meeting the FIT budget for new designs is becoming a major challenge.\u0000 Solutions based on replicating threads have been explored deeply; however, their high cost in performance and energy make them unsuitable for current designs. Moreover, our studies based on a typical configuration for a modern processor show that focusing on the top 5 most vulnerable structures can provide up to 70% reduction in FIT rate. Therefore, full replication may overprotect the chip by reducing the FIT much below budget.\u0000 We propose Selective Replication, a lightweight-reconfigurable mechanism that achieves a high FIT reduction by protecting the most vulnerable instructions with minimal performance and energy impact. Low performance degradation is achieved by not requiring additional issue slots and reissuing instructions only during the time window between when they are retirable and they actually retire. Coverage can be reconfigured online by replicating only a subset of the instructions (the most vulnerable ones). Instructions' vulnerability is estimated based on the area they occupy and the time they spend in the issue queue. By changing the vulnerability threshold, we can adjust the trade-off between coverage and performance loss.\u0000 Results for an out-of-order processor configured similarly to Intel® Core#8482; Micro-Architecture show that our scheme can achieve over 65% FIT reduction with less than 4% performance degradation with small area and complexity overhead.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"40 1","pages":"8:1-8:30"},"PeriodicalIF":1.5,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80311933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Sinfonia: A new paradigm for building scalable distributed systems Sinfonia:构建可扩展分布式系统的新范例

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2009-11-01 DOI: 10.1145/1629087.1629088

M. Aguilera, A. Merchant, Mehul A. Shah, Alistair C. Veitch, C. Karamanolis

We propose a new paradigm for building scalable distributed systems. Our approach does not require dealing with message-passing protocols, a major complication in existing distributed systems. Instead, developers just design and manipulate data structures within our service called Sinfonia. Sinfonia keeps data for applications on a set of memory nodes, each exporting a linear address space. At the core of Sinfonia is a new minitransaction primitive that enables efficient and consistent access to data, while hiding the complexities that arise from concurrency and failures. Using Sinfonia, we implemented two very different and complex applications in a few months: a cluster file system and a group communication service. Our implementations perform well and scale to hundreds of machines.

我们提出了一种构建可扩展分布式系统的新范例。我们的方法不需要处理消息传递协议，这是现有分布式系统中的一个主要复杂问题。相反，开发人员只需在我们名为sinonia的服务中设计和操作数据结构。Sinfonia将应用程序的数据保存在一组内存节点上，每个节点导出一个线性地址空间。Sinfonia的核心是一个新的微事务原语，它支持对数据的高效和一致的访问，同时隐藏了并发和故障带来的复杂性。使用Sinfonia，我们在几个月内实现了两个非常不同且复杂的应用程序:集群文件系统和组通信服务。我们的实现性能良好，可扩展到数百台机器。

引用次数: 114

Automated anomaly detection and performance modeling of enterprise applications 企业应用程序的自动异常检测和性能建模

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2009-11-01 DOI: 10.1145/1629087.1629089

L. Cherkasova, K. Ozonat, N. Mi, J. Symons, E. Smirni

Automated tools for understanding application behavior and its changes during the application lifecycle are essential for many performance analysis and debugging tasks. Application performance issues have an immediate impact on customer experience and satisfaction. A sudden slowdown of enterprise-wide application can effect a large population of customers, lead to delayed projects, and ultimately can result in company financial loss. Significantly shortened time between new software releases further exacerbates the problem of thoroughly evaluating the performance of an updated application. Our thesis is that online performance modeling should be a part of routine application monitoring. Early, informative warnings on significant changes in application performance should help service providers to timely identify and prevent performance problems and their negative impact on the service. We propose a novel framework for automated anomaly detection and application change analysis. It is based on integration of two complementary techniques: (i) a regression-based transaction model that reflects a resource consumption model of the application, and (ii) an application performance signature that provides a compact model of runtime behavior of the application. The proposed integrated framework provides a simple and powerful solution for anomaly detection and analysis of essential performance changes in application behavior. An additional benefit of the proposed approach is its simplicity: It is not intrusive and is based on monitoring data that is typically available in enterprise production environments. The introduced solution further enables the automation of capacity planning and resource provisioning tasks of multitier applications in rapidly evolving IT environments.

用于理解应用程序行为及其在应用程序生命周期中的变化的自动化工具对于许多性能分析和调试任务是必不可少的。应用程序性能问题对客户体验和满意度有直接的影响。企业范围内应用程序的突然减速可能会影响大量客户，导致项目延迟，并最终导致公司的财务损失。新软件发布之间显著缩短的时间进一步加剧了彻底评估更新后应用程序性能的问题。我们的论点是，在线性能建模应该是日常应用程序监控的一部分。关于应用程序性能重大变化的早期、信息丰富的警告应该有助于服务提供者及时识别和防止性能问题及其对服务的负面影响。我们提出了一种新的自动异常检测和应用变更分析框架。它基于两种互补技术的集成:(i)反映应用程序资源消耗模型的基于回归的事务模型，以及(ii)提供应用程序运行时行为的紧凑模型的应用程序性能签名。所提出的集成框架为异常检测和分析应用程序行为的基本性能变化提供了一个简单而强大的解决方案。所建议的方法的另一个好处是它的简单性:它不具有侵入性，并且基于企业生产环境中通常可用的监视数据。引入的解决方案进一步支持在快速发展的IT环境中实现多层应用程序的容量规划和资源供应任务的自动化。

{"title":"Automated anomaly detection and performance modeling of enterprise applications","authors":"L. Cherkasova, K. Ozonat, N. Mi, J. Symons, E. Smirni","doi":"10.1145/1629087.1629089","DOIUrl":"https://doi.org/10.1145/1629087.1629089","url":null,"abstract":"Automated tools for understanding application behavior and its changes during the application lifecycle are essential for many performance analysis and debugging tasks. Application performance issues have an immediate impact on customer experience and satisfaction. A sudden slowdown of enterprise-wide application can effect a large population of customers, lead to delayed projects, and ultimately can result in company financial loss. Significantly shortened time between new software releases further exacerbates the problem of thoroughly evaluating the performance of an updated application. Our thesis is that online performance modeling should be a part of routine application monitoring. Early, informative warnings on significant changes in application performance should help service providers to timely identify and prevent performance problems and their negative impact on the service. We propose a novel framework for automated anomaly detection and application change analysis. It is based on integration of two complementary techniques: (i) a regression-based transaction model that reflects a resource consumption model of the application, and (ii) an application performance signature that provides a compact model of runtime behavior of the application. The proposed integrated framework provides a simple and powerful solution for anomaly detection and analysis of essential performance changes in application behavior. An additional benefit of the proposed approach is its simplicity: It is not intrusive and is based on monitoring data that is typically available in enterprise production environments. The introduced solution further enables the automation of capacity planning and resource provisioning tasks of multitier applications in rapidly evolving IT environments.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"38 1","pages":"6:1-6:32"},"PeriodicalIF":1.5,"publicationDate":"2009-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88196143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

Practical and low-overhead masking of failures of TCP-based servers 实用且低开销的基于tcp的服务器故障屏蔽

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2009-05-01 DOI: 10.1145/1534909.1534911

D. Zagorodnov, K. Marzullo, L. Alvisi, T. Bressoud

This article describes an architecture that allows a replicated service to survive crashes without breaking its TCP connections. Our approach does not require modifications to the TCP protocol, to the operating system on the server, or to any of the software running on the clients. Furthermore, it runs on commodity hardware. We compare two implementations of this architecture (one based on primary/backup replication and another based on message logging) focusing on scalability, failover time, and application transparency. We evaluate three types of services: a file server, a Web server, and a multimedia streaming server. Our experiments suggest that the approach incurs low overhead on throughput, scales well as the number of clients increases, and allows recovery of the service in near-optimal time.

本文描述了一种体系结构，该体系结构允许复制服务在崩溃时存活下来，而不会破坏其TCP连接。我们的方法不需要修改TCP协议、服务器上的操作系统或客户端上运行的任何软件。此外，它运行在商用硬件上。我们比较了该体系结构的两种实现(一种基于主/备份复制，另一种基于消息日志)，重点关注可伸缩性、故障转移时间和应用程序透明性。我们评估了三种类型的服务:文件服务器、Web服务器和多媒体流媒体服务器。我们的实验表明，该方法在吞吐量上的开销很低，随着客户端数量的增加而扩展得很好，并且允许在接近最佳的时间内恢复服务。

引用次数: 28

Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets 分布式散列草图:分布式多集的可伸缩、高效和准确的基数估计

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2009-02-01 DOI: 10.1145/1482619.1482621

Nikos Ntarmos, P. Triantafillou, G. Weikum

Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.

对分布式系统中的项进行计数，特别是对多集的基数进行估计，对于各种各样的应用程序非常重要，并且是新兴的internet规模信息系统的基本构建块。这类应用程序的示例包括在点对点数据共享中优化查询访问计划，以及在分布式信息检索中计算数据项的重要性(等级/分数)。本文解决的一般形式问题是计算具有某些属性(例如，文件名包含“spiderman”的不同文件)的网络范围内的不同数量的项目，其中网络中的每个节点都包含任意子集，可能与其他节点的子集重叠。可行方法必须满足的关键要求是:(1)针对非常大的网络规模的可伸缩性，(2)消息开销方面的效率，(3)存储和访问的负载平衡，(4)基数估计的准确性，以及(5)应用程序的简单性和易于集成。本文为这个问题提供了DHS(分布式哈希草图)方法:一个分布式的、可扩展的、高效的、准确的多集基数估计器。DHS基于哈希草图进行概率计数，但根据分布式哈希表的原则，以明智的方式将每个计数器的位分布在网络节点上，同时要注意快速访问和聚合以及更新成本。本文讨论了各种设计选择，展示了估计精度、跳数效率和负载分配公平性之间的可调权衡。我们进一步贡献了我们所有方法的成熟的、公开可用的、开源的实现，以及针对各种设置的全面的实验评估。

{"title":"Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets","authors":"Nikos Ntarmos, P. Triantafillou, G. Weikum","doi":"10.1145/1482619.1482621","DOIUrl":"https://doi.org/10.1145/1482619.1482621","url":null,"abstract":"Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"30 1","pages":"2:1-2:53"},"PeriodicalIF":1.5,"publicationDate":"2009-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90480440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Hill-climbing SMT processor resource distribution 爬坡式SMT处理器资源分布

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2009-02-01 DOI: 10.1145/1482619.1482620

Seungryul Choi, D. Yeung

The key to high performance in Simultaneous MultiThreaded (SMT) processors lies in optimizing the distribution of shared resources to active threads. Existing resource distribution techniques optimize performance only indirectly. They infer potential performance bottlenecks by observing indicators, like instruction occupancy or cache miss counts, and take actions to try to alleviate them. While the corrective actions are designed to improve performance, their actual performance impact is not known since end performance is never monitored. Consequently, potential performance gains are lost whenever the corrective actions do not effectively address the actual bottlenecks occurring in the pipeline. We propose a different approach to SMT resource distribution that optimizes end performance directly. Our approach observes the impact that resource distribution decisions have on performance at runtime, and feeds this information back to the resource distribution mechanisms to improve future decisions. By evaluating many different resource distributions, our approach tries to learn the best distribution over time. Because we perform learning online, learning time is crucial. We develop a hill-climbing algorithm that quickly learns the best distribution of resources by following the performance gradient within the resource distribution space. We also develop several ideal learning algorithms to enable deeper insights through limit studies. This article conducts an in-depth investigation of hill-climbing SMT resource distribution using a comprehensive suite of 63 multiprogrammed workloads. Our results show hill-climbing outperforms ICOUNT, FLUSH, and DCRA (three existing SMT techniques) by 11.4%, 11.5%, and 2.8%, respectively, under the weighted IPC metric. A limit study conducted using our ideal learning algorithms shows our approach can potentially outperform the same techniques by 19.2%, 18.0%, and 7.6%, respectively, thus demonstrating additional room exists for further improvement. Using our ideal algorithms, we also identify three bottlenecks that limit online learning speed: local maxima, phased behavior, and interepoch jitter. We define metrics to quantify these learning bottlenecks, and characterize the extent to which they occur in our workloads. Finally, we conduct a sensitivity study, and investigate several extensions to improve our hill-climbing technique.

同步多线程(SMT)处理器中高性能的关键在于优化向活动线程分配共享资源。现有的资源分配技术只能间接地优化性能。他们通过观察指标(如指令占用或缓存丢失计数)来推断潜在的性能瓶颈，并采取措施试图缓解这些瓶颈。虽然纠正措施是为了提高性能而设计的，但它们对性能的实际影响是未知的，因为从未对最终性能进行监控。因此，只要纠正措施不能有效地解决管道中出现的实际瓶颈，就会失去潜在的性能收益。我们提出了一种不同的SMT资源分配方法，直接优化终端性能。我们的方法观察资源分配决策在运行时对性能的影响，并将这些信息反馈给资源分配机制，以改进未来的决策。通过评估许多不同的资源分布，我们的方法试图随着时间的推移学习最佳分布。因为我们在网上学习，所以学习时间是至关重要的。我们开发了一种爬坡算法，通过遵循资源分布空间内的性能梯度，快速学习到资源的最佳分布。我们还开发了几种理想的学习算法，以便通过极限研究获得更深入的见解。本文使用一个包含63个多程序工作负载的综合套件，对爬山式SMT资源分布进行了深入的研究。我们的研究结果表明，在加权IPC指标下，爬坡比ICOUNT、FLUSH和DCRA(三种现有的SMT技术)分别高出11.4%、11.5%和2.8%。使用我们的理想学习算法进行的极限研究表明，我们的方法可能比相同的技术分别高出19.2%，18.0%和7.6%，从而表明存在进一步改进的额外空间。使用我们的理想算法，我们还确定了限制在线学习速度的三个瓶颈:局部最大值，阶段性行为和历元间抖动。我们定义度量来量化这些学习瓶颈，并描述它们在我们的工作负载中出现的程度。最后，我们进行了敏感性研究，并研究了一些扩展来改进我们的爬山技术。

{"title":"Hill-climbing SMT processor resource distribution","authors":"Seungryul Choi, D. Yeung","doi":"10.1145/1482619.1482620","DOIUrl":"https://doi.org/10.1145/1482619.1482620","url":null,"abstract":"The key to high performance in Simultaneous MultiThreaded (SMT) processors lies in optimizing the distribution of shared resources to active threads. Existing resource distribution techniques optimize performance only indirectly. They infer potential performance bottlenecks by observing indicators, like instruction occupancy or cache miss counts, and take actions to try to alleviate them. While the corrective actions are designed to improve performance, their actual performance impact is not known since end performance is never monitored. Consequently, potential performance gains are lost whenever the corrective actions do not effectively address the actual bottlenecks occurring in the pipeline.\u0000 We propose a different approach to SMT resource distribution that optimizes end performance directly. Our approach observes the impact that resource distribution decisions have on performance at runtime, and feeds this information back to the resource distribution mechanisms to improve future decisions. By evaluating many different resource distributions, our approach tries to learn the best distribution over time. Because we perform learning online, learning time is crucial. We develop a hill-climbing algorithm that quickly learns the best distribution of resources by following the performance gradient within the resource distribution space. We also develop several ideal learning algorithms to enable deeper insights through limit studies.\u0000 This article conducts an in-depth investigation of hill-climbing SMT resource distribution using a comprehensive suite of 63 multiprogrammed workloads. Our results show hill-climbing outperforms ICOUNT, FLUSH, and DCRA (three existing SMT techniques) by 11.4%, 11.5%, and 2.8%, respectively, under the weighted IPC metric. A limit study conducted using our ideal learning algorithms shows our approach can potentially outperform the same techniques by 19.2%, 18.0%, and 7.6%, respectively, thus demonstrating additional room exists for further improvement. Using our ideal algorithms, we also identify three bottlenecks that limit online learning speed: local maxima, phased behavior, and interepoch jitter. We define metrics to quantify these learning bottlenecks, and characterize the extent to which they occur in our workloads. Finally, we conduct a sensitivity study, and investigate several extensions to improve our hill-climbing technique.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"22 1","pages":"1:1-1:47"},"PeriodicalIF":1.5,"publicationDate":"2009-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85472037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Improving peer-to-peer performance through server-side scheduling 通过服务器端调度提高点对点性能

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2008-12-01 DOI: 10.1145/1455258.1455260

Y. Qiao, F. Bustamante, P. Dinda, S. Birrer, Dong Lu

We show how to significantly improve the mean response time seen by both uploaders and downloaders in peer-to-peer data-sharing systems. Our work is motivated by the observation that response times are largely determined by the performance of the peers serving the requested objects, that is, by the peers in their capacity as servers. With this in mind, we take a close look at this server side of peers, characterizing its workload by collecting and examining an extensive set of traces. Using trace-driven simulation, we demonstrate the promise and potential problems with scheduling policies based on shortest-remaining-processing-time (SRPT), the algorithm known to be optimal for minimizing mean response time. The key challenge to using SRPT in this context is determining request service times. In addressing this challenge, we introduce two new estimators that enable predictive SRPT scheduling policies that closely approach the performance of ideal SRPT. We evaluate our approach through extensive single-server and system-level simulation coupled with real Internet deployment and experimentation.

我们展示了如何显著提高点对点数据共享系统中上传者和下载者的平均响应时间。我们的工作的动机是观察到响应时间在很大程度上取决于为请求对象提供服务的对等体的性能，也就是说，取决于作为服务器的对等体的性能。考虑到这一点，我们将仔细研究对等节点的服务器端，通过收集和检查一组广泛的跟踪来描述其工作负载。使用跟踪驱动的仿真，我们展示了基于最短剩余处理时间(SRPT)的调度策略的前景和潜在问题，SRPT是已知的最小化平均响应时间的最佳算法。在此上下文中使用SRPT的关键挑战是确定请求服务时间。为了解决这一挑战，我们引入了两个新的估计器，它们支持预测性SRPT调度策略，这些策略非常接近理想SRPT的性能。我们通过广泛的单服务器和系统级模拟以及真实的Internet部署和实验来评估我们的方法。

引用次数: 8

Adaptive work-stealing with parallelism feedback 具有并行反馈的自适应工作窃取

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2008-09-01 DOI: 10.1145/1394441.1394443

Kunal Agrawal, C. Leiserson, Yuxiong He, W. Hsu

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority. More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/&Ptilde; + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and &Ptilde; denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, &Ptilde; ≪ T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal. We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10%

共享多程序设计环境中的多处理器调度可以结构化为两级调度，其中内核级作业调度器将处理器分配给作业，用户级线程调度器在分配的处理器上调度作业的工作。我们为fork-join多线程作业提供了一个随机窃取工作的线程调度器，它以请求处理器的形式向作业调度器提供持续的并行性反馈。我们的a - steal算法适用于大型并行服务器，其中许多作业共享一个公共多处理器资源，并且在作业执行期间，特定作业可用的处理器数量可能会变化。假设作业调度器分配的处理器数量永远不会超过作业线程调度器所要求的数量，那么a - steal可以保证作业在接近最佳的时间内完成，同时至少利用分配的处理器的一个常数部分。我们将作业调度器建模为线程调度器的对手，要求线程调度器对操作环境以及作业调度器的管理策略具有健壮性。例如，作业调度器可能会在作业几乎不需要大量处理器的时候使它们可用。为了在这种严格的对抗性假设下分析自适应线程调度器的性能，我们引入了一种称为修剪分析的新技术，它允许我们证明线程调度器在不超过一小部分时间步上表现不佳，而在绝大多数时间步上表现出接近最佳的行为。更准确地说，假设一个作业的功T1和张成的空间T∞。在有P个处理器的机器上，a - steal在O(T1/&Ptilde)的预期持续时间内完成任务。+ T∞+ llgp)时间步长，其中L为调度量子的长度，&Ptilde;表示O(T∞+ llg P)裁剪的可用性。这个量是处理器可用性在除具有最高处理器可用性的O(T∞+ L lg P)时间步之外的所有时间步上的平均值。当作业的并行性占精简可用性的主导地位时，即&Ptilde;≪T1/T∞，可实现近乎完美的线性加速。相反，当裁剪均值占并行度的主导地位时，作业的渐近运行时间接近其跨度的长度，这是最优的。我们使用合成工作负载在模拟的多处理器系统上测量了a - steal的性能。对于具有足够并行性的作业，我们的实验证实，a - steal在各种处理器可用性配置文件中提供了几乎完美的线性加速。我们将A-STEAL与ABP算法进行了比较，ABP算法是由Arora等人[1998]开发的一种不采用并行反馈的自适应工作窃取线程调度器。在具有大量处理器的中度到重度负载机器上，尽管每一步分配的处理器数量相同或更少，但A-STEAL完成作业的速度通常是ABP的两倍以上，而浪费的处理器周期仅为ABP浪费的10%。

{"title":"Adaptive work-stealing with parallelism feedback","authors":"Kunal Agrawal, C. Leiserson, Yuxiong He, W. Hsu","doi":"10.1145/1394441.1394443","DOIUrl":"https://doi.org/10.1145/1394441.1394443","url":null,"abstract":"Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors.\u0000 We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority.\u0000 More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/&Ptilde; + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and &Ptilde; denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, &Ptilde; ≪ T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal.\u0000 We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10%","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"160 38 1","pages":"7:1-7:32"},"PeriodicalIF":1.5,"publicationDate":"2008-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89128278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

A stateless approach to connection-oriented protocols 面向连接协议的无状态方法

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2008-09-01 DOI: 10.1145/1394441.1394444

Alan Shieh, A. Myers, E. G. Sirer

Traditional operating system interfaces and network protocol implementations force some system state to be kept on both sides of a connection. This state ties the connection to its endpoints, impedes transparent failover, permits denial-of-service attacks, and limits scalability. This article introduces a novel TCP-like transport protocol and a new interface to replace sockets that together enable all state to be kept on one endpoint, allowing the other endpoint, typically the server, to operate without any per-connection state. Called Trickles, this approach enables servers to scale well with increasing numbers of clients, consume fewer resources, and better resist denial-of-service attacks. Measurements on a full implementation in Linux indicate that Trickles achieves performance comparable to TCP/IP, interacts well with other flows, and scales well. Trickles also enables qualitatively different kinds of networked services. Services can be geographically replicated and contacted through an anycast primitive for improved availability and performance. Widely-deployed practices that currently have client-observable side effects, such as periodic server reboots, connection redirection, and failover, can be made transparent, and perform well, under Trickles. The protocol is secure against tampering and replay attacks, and the client interface is backward-compatible, requiring no changes to sockets-based client applications.

传统的操作系统接口和网络协议实现强制在连接的两端保留一些系统状态。这种状态将连接绑定到它的端点，阻碍透明的故障转移，允许拒绝服务攻击，并限制可伸缩性。本文介绍了一种新的类似tcp的传输协议和一个新的接口，用于替换套接字，这些套接字使所有状态都保留在一个端点上，从而允许另一个端点(通常是服务器)在没有任何连接状态的情况下进行操作。这种方法被称为Trickles，它使服务器能够很好地随客户端数量的增加而扩展，消耗更少的资源，并更好地抵御拒绝服务攻击。对Linux中完整实现的测量表明，Trickles实现了与TCP/IP相当的性能，与其他流很好地交互，并且具有良好的可伸缩性。涓滴还支持不同性质的网络服务。可以通过任意cast原语在地理上复制和联系服务，以提高可用性和性能。目前广泛部署的具有客户端可观察到的副作用的实践，如周期性的服务器重启、连接重定向和故障转移，可以在Trickles下变得透明并表现良好。该协议对篡改和重放攻击是安全的，并且客户端接口是向后兼容的，不需要更改基于套接字的客户端应用程序。

{"title":"A stateless approach to connection-oriented protocols","authors":"Alan Shieh, A. Myers, E. G. Sirer","doi":"10.1145/1394441.1394444","DOIUrl":"https://doi.org/10.1145/1394441.1394444","url":null,"abstract":"Traditional operating system interfaces and network protocol implementations force some system state to be kept on both sides of a connection. This state ties the connection to its endpoints, impedes transparent failover, permits denial-of-service attacks, and limits scalability. This article introduces a novel TCP-like transport protocol and a new interface to replace sockets that together enable all state to be kept on one endpoint, allowing the other endpoint, typically the server, to operate without any per-connection state. Called Trickles, this approach enables servers to scale well with increasing numbers of clients, consume fewer resources, and better resist denial-of-service attacks. Measurements on a full implementation in Linux indicate that Trickles achieves performance comparable to TCP/IP, interacts well with other flows, and scales well. Trickles also enables qualitatively different kinds of networked services. Services can be geographically replicated and contacted through an anycast primitive for improved availability and performance. Widely-deployed practices that currently have client-observable side effects, such as periodic server reboots, connection redirection, and failover, can be made transparent, and perform well, under Trickles. The protocol is secure against tampering and replay attacks, and the client interface is backward-compatible, requiring no changes to sockets-based client applications.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"9 1","pages":"8:1-8:50"},"PeriodicalIF":1.5,"publicationDate":"2008-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87129931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

RaWMS - Random Walk Based Lightweight Membership Service for Wireless Ad Hoc Networks RaWMS——无线自组织网络中基于随机游动的轻量级会员服务

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2008-06-01 DOI: 10.1145/1365815.1365817

Ziv Bar-Yossef, R. Friedman, G. Kliot

This article presents RaWMS, a novel lightweight random membership service for ad hoc networks. The service provides each node with a partial uniformly chosen view of network nodes. Such a membership service is useful, for example, in data dissemination algorithms, lookup and discovery services, peer sampling services, and complete membership construction. The design of RaWMS is based on a novel reverse random walk (RW) sampling technique. The article includes a formal analysis of both the reverse RW sampling technique and RaWMS and verifies it through a detailed simulation study. In addition, RaWMS is compared both analytically and by simulations with a number of other known methods such as flooding and gossip-based techniques.

本文介绍了一种用于自组织网络的新型轻量级随机成员服务RaWMS。该服务为每个节点提供部分统一选择的网络节点视图。这样的成员服务在数据传播算法、查找和发现服务、对等抽样服务和完整的成员构造等方面都很有用。RaWMS的设计基于一种新颖的反向随机漫步(RW)采样技术。本文对反向RW采样技术和RaWMS进行了形式化分析，并通过详细的仿真研究对其进行了验证。此外，还将RaWMS与许多其他已知方法(如泛洪和基于八卦的技术)进行分析和模拟比较。

引用次数: 82