ACM Transactions on Computer Systems最新文献_第10页

Experience distributing objects in an SMMP OS 体验在SMMP操作系统中分发对象

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2007-08-01 DOI: 10.1145/1275517.1275518

J. Appavoo, D. D. Silva, O. Krieger, M. Auslander, Michal Ostrowski, Bryan S. Rosenburg, Amos Waterland, R. Wisniewski, J. Xenidis, M. Stumm, Livio Baldini Soares

Designing and implementing system software so that it scales well on shared-memory multiprocessors (SMMPs) has proven to be surprisingly challenging. To improve scalability, most designers to date have focused on concurrency by iteratively eliminating the need for locks and reducing lock contention. However, our experience indicates that locality is just as, if not more, important and that focusing on locality ultimately leads to a more scalable system. In this paper, we describe a methodology and a framework for constructing system software structured for locality, exploiting techniques similar to those used in distributed systems. Specifically, we found two techniques to be effective in improving scalability of SMMP operating systems: (i) an object-oriented structure that minimizes sharing by providing a natural mapping from independent requests to independent code paths and data structures, and (ii) the selective partitioning, distribution, and replication of object implementations in order to improve locality. We describe concrete examples of distributed objects and our experience implementing them. We demonstrate that the distributed implementations improve the scalability of operating-system-intensive parallel workloads.

设计和实现系统软件以便在共享内存多处理器(smmp)上很好地扩展已被证明是非常具有挑战性的。为了提高可伸缩性，到目前为止，大多数设计人员都通过迭代地消除对锁的需求和减少锁争用来关注并发性。然而，我们的经验表明，局部性即使不是更重要，也是同样重要的，关注局部性最终会导致一个更具可扩展性的系统。在本文中，我们描述了一种方法和框架，用于构建局部性系统软件，利用类似于分布式系统中使用的技术。具体来说，我们发现有两种技术可以有效地提高SMMP操作系统的可伸缩性:(i)一个面向对象的结构，通过提供从独立请求到独立代码路径和数据结构的自然映射来最大限度地减少共享，以及(ii)对象实现的选择性分区、分布和复制，以改善局域性。我们描述了分布式对象的具体例子和我们实现它们的经验。我们证明了分布式实现提高了操作系统密集型并行工作负载的可伸缩性。

{"title":"Experience distributing objects in an SMMP OS","authors":"J. Appavoo, D. D. Silva, O. Krieger, M. Auslander, Michal Ostrowski, Bryan S. Rosenburg, Amos Waterland, R. Wisniewski, J. Xenidis, M. Stumm, Livio Baldini Soares","doi":"10.1145/1275517.1275518","DOIUrl":"https://doi.org/10.1145/1275517.1275518","url":null,"abstract":"Designing and implementing system software so that it scales well on shared-memory multiprocessors (SMMPs) has proven to be surprisingly challenging. To improve scalability, most designers to date have focused on concurrency by iteratively eliminating the need for locks and reducing lock contention. However, our experience indicates that locality is just as, if not more, important and that focusing on locality ultimately leads to a more scalable system.\u0000 In this paper, we describe a methodology and a framework for constructing system software structured for locality, exploiting techniques similar to those used in distributed systems. Specifically, we found two techniques to be effective in improving scalability of SMMP operating systems: (i) an object-oriented structure that minimizes sharing by providing a natural mapping from independent requests to independent code paths and data structures, and (ii) the selective partitioning, distribution, and replication of object implementations in order to improve locality. We describe concrete examples of distributed objects and our experience implementing them. We demonstrate that the distributed implementations improve the scalability of operating-system-intensive parallel workloads.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"24 1","pages":"6"},"PeriodicalIF":1.5,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77917275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

The WaveScalar architecture WaveScalar架构

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2007-05-01 DOI: 10.1145/1233307.1233308

S. Swanson, Andrew Schwerin, M. Kim, Andrew Petersen, Andrew Putnam, Ken Michelson, M. Oskin, S. Eggers

Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function. To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication. This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.

硅技术将继续为原始晶体管的可用性提供指数级增长。然而，将这种资源有效地转化为应用程序性能是一个公开的挑战，传统的超标量设计将无法满足这一挑战。我们将WaveScalar作为传统设计的可扩展替代方案。WaveScalar是一个数据流指令集和执行模型，专为可伸缩、低复杂度/高性能处理器设计。与以前的数据流机器不同，WaveScalar可以有效地提供命令式语言所需的顺序内存语义。为了让程序员能够轻松地表达并行性，WaveScalar支持线程风格的粗粒度多线程和数据流风格的细粒度线程。此外，它还允许在应用程序甚至单个函数中混合这两种样式。为了执行wavecular程序，我们设计了一个可扩展的、基于tile的处理器架构，叫做WaveCache。当程序执行时，WaveCache将程序的指令映射到它的处理元素数组(pe)上。对于许多调用，指令保持在它们的处理元素上，当指令的工作集发生变化时，WaveCache删除未使用的指令，并在它们的位置上映射新的指令。这些指令通过可扩展的、分层的片上互连直接相互通信，避免了对长线和广播通信的需要。本文介绍了WaveScalar指令集，并对基于当前技术的仿真实现进行了评估。对于单线程应用程序，WaveCache实现了与传统处理器相当的性能，但在更小的区域。对于粗粒度线程应用程序，WaveCache在最多64个线程的情况下实现了近乎线性的加速，并且在众所周知的细粒度线程版本的内核上每个周期可以维持7- 14次乘法累加。最后，我们将两种风格的线程应用于Spec2000中的equake，并将其速度提高9倍，与串行版本相比。

{"title":"The WaveScalar architecture","authors":"S. Swanson, Andrew Schwerin, M. Kim, Andrew Petersen, Andrew Putnam, Ken Michelson, M. Oskin, S. Eggers","doi":"10.1145/1233307.1233308","DOIUrl":"https://doi.org/10.1145/1233307.1233308","url":null,"abstract":"Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function.\u0000 To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication.\u0000 This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"7 1","pages":"4:1-4:54"},"PeriodicalIF":1.5,"publicationDate":"2007-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79682218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 143

Concurrent programming without locks 不带锁的并发编程

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2007-05-01 DOI: 10.1145/1233307.1233309

K. Fraser, T. Harris

Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory data structures. However, their apparent simplicity is deceptive: It is hard to design scalable locking strategies because locks can harbor problems such as priority inversion, deadlock, and convoying. Furthermore, scalable lock-based systems are not readily composable when building compound operations. In looking for solutions to these problems, interest has developed in nonblocking systems which have promised scalability and robustness by eschewing mutual exclusion while still ensuring safety. However, existing techniques for building nonblocking systems are rarely suitable for practical use, imposing substantial storage overheads, serializing nonconflicting operations, or requiring instructions not readily available on today's CPUs. In this article we present three APIs which make it easier to develop nonblocking implementations of arbitrary data structures. The first API is a multiword compare-and-swap operation (MCAS) which atomically updates a set of memory locations. This can be used to advance a data structure from one consistent state to another. The second API is a word-based software transactional memory (WSTM) which can allow sequential code to be reused more directly than with MCAS and which provides better scalability when locations are being read rather than being updated. The third API is an object-based software transactional memory (OSTM). OSTM allows a simpler implementation than WSTM, but at the cost of reengineering the code to use OSTM objects. We present practical implementations of all three of these APIs, built from operations available across all of today's major CPU families. We illustrate the use of these APIs by using them to build highly concurrent skip lists and red-black trees. We compare the performance of the resulting implementations against one another and against high-performance lock-based systems. These results demonstrate that it is possible to build useful nonblocking data structures with performance comparable to, or better than, sophisticated lock-based designs.

互斥锁仍然是对共享内存数据结构进行并发控制的事实上的机制。然而，它们表面上的简单是具有欺骗性的:很难设计可伸缩的锁定策略，因为锁可能包含优先级反转、死锁和传输等问题。此外，在构建复合操作时，可扩展的基于锁的系统不容易组合。在寻找这些问题的解决方案时，人们对非阻塞系统产生了兴趣，这些系统通过避免互斥来保证可扩展性和鲁棒性，同时仍然确保安全性。然而，用于构建非阻塞系统的现有技术很少适合实际使用，这会增加大量的存储开销，序列化无冲突的操作，或者需要在当前cpu上不易获得的指令。在本文中，我们将介绍三个api，它们使开发任意数据结构的非阻塞实现变得更加容易。第一个API是一个多字比较与交换操作(MCAS)，它自动更新一组内存位置。这可以用于将数据结构从一种一致状态推进到另一种一致状态。第二个API是基于单词的软件事务性内存(WSTM)，它允许比MCAS更直接地重用顺序代码，并且在读取位置而不是更新位置时提供更好的可伸缩性。第三个API是基于对象的软件事务性内存(OSTM)。OSTM允许比WSTM更简单的实现，但代价是重新设计代码以使用OSTM对象。我们给出了这三种api的实际实现，它们基于当今所有主要CPU系列的操作构建而成。我们通过使用这些api来构建高度并发的跳跃表和红黑树来说明这些api的使用。我们将最终实现的性能与其他实现以及基于高性能锁的系统进行比较。这些结果表明，构建有用的非阻塞数据结构是可能的，其性能与复杂的基于锁的设计相当，甚至更好。

{"title":"Concurrent programming without locks","authors":"K. Fraser, T. Harris","doi":"10.1145/1233307.1233309","DOIUrl":"https://doi.org/10.1145/1233307.1233309","url":null,"abstract":"Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory data structures. However, their apparent simplicity is deceptive: It is hard to design scalable locking strategies because locks can harbor problems such as priority inversion, deadlock, and convoying. Furthermore, scalable lock-based systems are not readily composable when building compound operations. In looking for solutions to these problems, interest has developed in nonblocking systems which have promised scalability and robustness by eschewing mutual exclusion while still ensuring safety. However, existing techniques for building nonblocking systems are rarely suitable for practical use, imposing substantial storage overheads, serializing nonconflicting operations, or requiring instructions not readily available on today's CPUs.\u0000 In this article we present three APIs which make it easier to develop nonblocking implementations of arbitrary data structures. The first API is a multiword compare-and-swap operation (MCAS) which atomically updates a set of memory locations. This can be used to advance a data structure from one consistent state to another. The second API is a word-based software transactional memory (WSTM) which can allow sequential code to be reused more directly than with MCAS and which provides better scalability when locations are being read rather than being updated. The third API is an object-based software transactional memory (OSTM). OSTM allows a simpler implementation than WSTM, but at the cost of reengineering the code to use OSTM objects.\u0000 We present practical implementations of all three of these APIs, built from operations available across all of today's major CPU families. We illustrate the use of these APIs by using them to build highly concurrent skip lists and red-black trees. We compare the performance of the resulting implementations against one another and against high-performance lock-based systems. These results demonstrate that it is possible to build useful nonblocking data structures with performance comparable to, or better than, sophisticated lock-based designs.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"17 1","pages":"5"},"PeriodicalIF":1.5,"publicationDate":"2007-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83791538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 295

Specifying memory consistency of write buffer multiprocessors 指定写缓冲区多处理器的内存一致性

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2007-02-01 DOI: 10.1145/1189736.1189737

L. Higham, L. Jackson, J. Kawash

Write buffering is one of many successful mechanisms that improves the performance and scalability of multiprocessors. However, it leads to more complex memory system behavior, which cannot be described using intuitive consistency models, such as Sequential Consistency. It is crucial to provide programmers with a specification of the exact behavior of such complex memories. This article presents a uniform framework for describing systems at different levels of abstraction and proving their equivalence. The framework is used to derive and prove correct simple specifications in terms of program-level instructions of the sparc total store order and partial store order memories.The framework is also used to examine the sparc relaxed memory order. We show that it is not a memory consistency model that corresponds to any implementation on a multiprocessor that uses write-buffers, even though we suspect that the sparc version 9 specification of relaxed memory order was intended to capture a general write-buffer architecture. The same technique is used to show that Coherence does not correspond to a write-buffer architecture. A corollary, which follows from the relationship between Coherence and Alpha, is that any implementation of Alpha consistency using write-buffers cannot produce all possible Alpha computations. That is, there are some computations that satisfy the Alpha specification but cannot occur in the given write-buffer implementation.

写缓冲是提高多处理器性能和可伸缩性的众多成功机制之一。然而，它会导致更复杂的内存系统行为，这些行为无法用直观的一致性模型(如顺序一致性)来描述。为程序员提供这种复杂内存的确切行为规范是至关重要的。本文提出了一个统一的框架，用于描述不同抽象层次的系统并证明它们的等价性。该框架用于推导和证明基于sparc总存储顺序和部分存储顺序存储器的程序级指令的简单规范。该框架还用于检查空间放松记忆顺序。我们表明，它不是一个内存一致性模型，它与使用写缓冲区的多处理器上的任何实现都不相对应，尽管我们怀疑放宽内存顺序的sparc版本9规范旨在捕获通用的写缓冲区体系结构。同样的技术也用于证明Coherence并不对应于写缓冲区架构。从Coherence和Alpha之间的关系得出的推论是，任何使用写缓冲区实现Alpha一致性都不能产生所有可能的Alpha计算。也就是说，有一些计算满足Alpha规范，但不能在给定的写缓冲区实现中发生。

{"title":"Specifying memory consistency of write buffer multiprocessors","authors":"L. Higham, L. Jackson, J. Kawash","doi":"10.1145/1189736.1189737","DOIUrl":"https://doi.org/10.1145/1189736.1189737","url":null,"abstract":"Write buffering is one of many successful mechanisms that improves the performance and scalability of multiprocessors. However, it leads to more complex memory system behavior, which cannot be described using intuitive consistency models, such as Sequential Consistency. It is crucial to provide programmers with a specification of the exact behavior of such complex memories. This article presents a uniform framework for describing systems at different levels of abstraction and proving their equivalence. The framework is used to derive and prove correct simple specifications in terms of program-level instructions of the sparc total store order and partial store order memories.The framework is also used to examine the sparc relaxed memory order. We show that it is not a memory consistency model that corresponds to any implementation on a multiprocessor that uses write-buffers, even though we suspect that the sparc version 9 specification of relaxed memory order was intended to capture a general write-buffer architecture. The same technique is used to show that Coherence does not correspond to a write-buffer architecture. A corollary, which follows from the relationship between Coherence and Alpha, is that any implementation of Alpha consistency using write-buffers cannot produce all possible Alpha computations. That is, there are some computations that satisfy the Alpha specification but cannot occur in the given write-buffer implementation.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"8 1","pages":"1"},"PeriodicalIF":1.5,"publicationDate":"2007-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73706350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Comprehensive multivariate extrapolation modeling of multiprocessor cache miss rates 多处理器缓存缺失率的综合多元外推模型

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2007-02-01 DOI: 10.1145/1189736.1189738

Ilya Gluhovsky, D. Vengerov, B. O'Krafka

Cache miss rates are an important subset of system model inputs. Cache miss rate models are used for broad design space exploration in which many cache configurations cannot be simulated directly due to limitations of trace collection setups or available resources. Often it is not practical to simulate large caches. Large processor counts and consequent potentially high degree of cache sharing are frequently not reproducible on small existing systems. In this article, we present an approach to building multivariate regression models for predicting cache miss rates beyond the range of collectible data. The extrapolation model attempts to accurately estimate the high-level trend of the existing data, which can be extended in a natural way. We extend previous work by its applicability to multiple miss rate components and its ability to model a wide range of cache parameters, including size, line size, associativity and sharing. The stability of extrapolation is recognized to be a crucial requirement. The proposed extrapolation model is shown to be stable to small data perturbations that may be introduced during data collection.We show the effectiveness of the technique by applying it to two commercial workloads. The wide design space contains configurations that are much larger than those for which miss rate data were available. The fitted data match the simulation data very well. The various curves show how a miss rate model is useful for not only estimating the performance of specific configurations, but also for providing insight into miss rate trends.

缓存缺失率是系统模型输入的一个重要子集。缓存缺失率模型用于广泛的设计空间探索，其中由于跟踪收集设置或可用资源的限制，许多缓存配置不能直接模拟。通常，模拟大型缓存是不实际的。在现有的小型系统上，大型处理器数量和由此产生的潜在的高度缓存共享通常是不可复制的。在本文中，我们提出了一种构建多元回归模型的方法，用于预测超出可收集数据范围的缓存缺失率。外推模型试图准确估计现有数据的高层趋势，可以自然地进行扩展。我们扩展了之前的工作，它适用于多个缺失率组件，并且能够模拟广泛的缓存参数，包括大小、行大小、结合性和共享。外推的稳定性被认为是一个至关重要的要求。所提出的外推模型对数据收集过程中可能引入的小数据扰动是稳定的。我们通过将其应用于两个商业工作负载来展示该技术的有效性。宽泛的设计空间包含的配置比缺失率数据可用的配置大得多。拟合数据与仿真数据吻合较好。各种曲线显示了脱靶率模型不仅对估计特定配置的性能很有用，而且对脱靶率趋势也很有用。

{"title":"Comprehensive multivariate extrapolation modeling of multiprocessor cache miss rates","authors":"Ilya Gluhovsky, D. Vengerov, B. O'Krafka","doi":"10.1145/1189736.1189738","DOIUrl":"https://doi.org/10.1145/1189736.1189738","url":null,"abstract":"Cache miss rates are an important subset of system model inputs. Cache miss rate models are used for broad design space exploration in which many cache configurations cannot be simulated directly due to limitations of trace collection setups or available resources. Often it is not practical to simulate large caches. Large processor counts and consequent potentially high degree of cache sharing are frequently not reproducible on small existing systems. In this article, we present an approach to building multivariate regression models for predicting cache miss rates beyond the range of collectible data. The extrapolation model attempts to accurately estimate the high-level trend of the existing data, which can be extended in a natural way. We extend previous work by its applicability to multiple miss rate components and its ability to model a wide range of cache parameters, including size, line size, associativity and sharing. The stability of extrapolation is recognized to be a crucial requirement. The proposed extrapolation model is shown to be stable to small data perturbations that may be introduced during data collection.We show the effectiveness of the technique by applying it to two commercial workloads. The wide design space contains configurations that are much larger than those for which miss rate data were available. The fitted data match the simulation data very well. The various curves show how a miss rate model is useful for not only estimating the performance of specific configurations, but also for providing insight into miss rate trends.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"54 8 1","pages":"2"},"PeriodicalIF":1.5,"publicationDate":"2007-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78321430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Rethink the sync 重新考虑同步

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2006-11-06 DOI: 10.1145/1394441.1394442

Edmund B. Nightingale, K. Veeraraghavan, Peter M. Chen, J. Flinn

We introduce external synchrony, a new model for local file I/O that provides the reliability and simplicity of synchronous I/O, yet also closely approximates the performance of asynchronous I/O. An external observer cannot distinguish the output of a computer with an externally synchronous file system from the output of a computer with a synchronous file system. No application modification is required to use an externally synchronous file system: in fact, application developers can program to the simpler synchronous I/O abstraction and still receive excellent performance. We have implemented an externally synchronous file system for Linux, called xsyncfs. Xsyncfs provides the same durability and ordering guarantees as those provided by a synchronously mounted ext3 file system. Yet, even for I/O-intensive benchmarks, xsyncfs performance is within 7% of ext3 mounted asynchronously. Compared to ext3 mounted synchronously, xsyncfs is up to two orders of magnitude faster.

我们介绍了外部同步，这是一种用于本地文件I/O的新模型，它提供了同步I/O的可靠性和简单性，但也非常接近异步I/O的性能。外部观察者无法区分具有外部同步文件系统的计算机的输出和具有同步文件系统的计算机的输出。使用外部同步文件系统不需要修改应用程序:事实上，应用程序开发人员可以按照更简单的同步I/O抽象进行编程，并且仍然可以获得出色的性能。我们已经为Linux实现了一个外部同步文件系统，称为xsyncfs。Xsyncfs提供了与同步挂载的ext3文件系统相同的持久性和排序保证。然而，即使对于I/ o密集型基准测试，xsyncfs的性能也只比异步挂载的ext3低7%。与同步挂载的ext3相比，xsyncfs快了两个数量级。

引用次数: 175

Bigtable: A Distributed Storage System for Structured Data Bigtable:结构化数据的分布式存储系统

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2006-11-06 DOI: 10.1145/1365815.1365816

Fay W. Chang, J. Dean, S. Ghemawat, Wilson C. Hsieh, D. Wallach, M. Burrows, Tushar Chandra, Andrew Fikes, R. Gruber

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Bigtable是一个分布式存储系统，用于管理结构化数据，其设计用于扩展到非常大的规模:数千台商用服务器上的pb级数据。谷歌的许多项目都将数据存储在Bigtable中，包括web索引、谷歌Earth和谷歌Finance。这些应用程序在数据大小(从url到网页再到卫星图像)和延迟需求(从后端批量处理到实时数据服务)方面对Bigtable提出了非常不同的要求。尽管有这些不同的需求，Bigtable已经成功地为所有这些谷歌产品提供了灵活、高性能的解决方案。在本文中，我们描述了Bigtable提供的简单数据模型，它为客户端提供了对数据布局和格式的动态控制，我们还描述了Bigtable的设计和实现。

引用次数: 5648

Speculative execution in a distributed file system 分布式文件系统中的推测执行

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2006-11-01 DOI: 10.1145/1189256.1189258

Edmund B. Nightingale, Peter M. Chen, J. Flinn

Speculator provides Linux kernel support for speculative execution. It allows multiple processes to share speculative state by tracking causal dependencies propagated through interprocess communication. It guarantees correct execution by preventing speculative processes from externalizing output, for example, sending a network message or writing to the screen, until the speculations on which that output depends have proven to be correct. Speculator improves the performance of distributed file systems by masking I/O latency and increasing I/O throughput. Rather than block during a remote operation, a file system predicts the operation's result, then uses Speculator to checkpoint the state of the calling process and speculatively continue its execution based on the predicted result. If the prediction is correct, the checkpoint is discarded; if it is incorrect, the calling process is restored to the checkpoint, and the operation is retried. We have modified the client, server, and network protocol of two distributed file systems to use Speculator. For PostMark and Andrew-style benchmarks, speculative execution results in a factor of 2 performance improvement for NFS over local area networks and an order of magnitude improvement over wide area networks. For the same benchmarks, Speculator enables the Blue File System to provide the consistency of single-copy file semantics and the safety of synchronous I/O, yet still outperform current distributed file systems with weaker consistency and safety.

Speculator为推测执行提供了Linux内核支持。它允许多个进程通过跟踪通过进程间通信传播的因果关系来共享推测状态。它通过防止推测进程外部化输出(例如，发送网络消息或向屏幕写入)来保证正确的执行，直到该输出所依赖的推测被证明是正确的。Speculator通过屏蔽I/O延迟和增加I/O吞吐量来提高分布式文件系统的性能。文件系统不是在远程操作期间阻塞，而是预测操作的结果，然后使用Speculator检查调用进程的状态，并根据预测的结果推测地继续执行。如果预测正确，则丢弃检查点;如果不正确，则将调用进程恢复到检查点，并重试操作。我们修改了两个分布式文件系统的客户端、服务器和网络协议来使用Speculator。对于邮戳和安德鲁风格的基准测试，推测执行导致NFS在局域网上的性能提高了2倍，在广域网上的性能提高了一个数量级。对于相同的基准测试，Speculator使Blue File System能够提供单副本文件语义的一致性和同步I/O的安全性，但仍然优于当前一致性和安全性较弱的分布式文件系统。

{"title":"Speculative execution in a distributed file system","authors":"Edmund B. Nightingale, Peter M. Chen, J. Flinn","doi":"10.1145/1189256.1189258","DOIUrl":"https://doi.org/10.1145/1189256.1189258","url":null,"abstract":"Speculator provides Linux kernel support for speculative execution. It allows multiple processes to share speculative state by tracking causal dependencies propagated through interprocess communication. It guarantees correct execution by preventing speculative processes from externalizing output, for example, sending a network message or writing to the screen, until the speculations on which that output depends have proven to be correct. Speculator improves the performance of distributed file systems by masking I/O latency and increasing I/O throughput. Rather than block during a remote operation, a file system predicts the operation's result, then uses Speculator to checkpoint the state of the calling process and speculatively continue its execution based on the predicted result. If the prediction is correct, the checkpoint is discarded; if it is incorrect, the calling process is restored to the checkpoint, and the operation is retried. We have modified the client, server, and network protocol of two distributed file systems to use Speculator. For PostMark and Andrew-style benchmarks, speculative execution results in a factor of 2 performance improvement for NFS over local area networks and an order of magnitude improvement over wide area networks. For the same benchmarks, Speculator enables the Blue File System to provide the consistency of single-copy file semantics and the safety of synchronous I/O, yet still outperform current distributed file systems with weaker consistency and safety.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"58 1","pages":"361-392"},"PeriodicalIF":1.5,"publicationDate":"2006-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83954425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Vigilante: End-to-end containment of Internet worm epidemics 治安维持者:端到端遏制互联网蠕虫的流行

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2006-10-01 DOI: 10.1145/1455258.1455259

Manuel Costa, J. Crowcroft, M. Castro, A. Rowstron, Lidong Zhou, Lintao Zhang, P. Barham

Worm containment must be automatic because worms can spread too fast for humans to respond. Recent work proposed network-level techniques to automate worm containment; these techniques have limitations because there is no information about the vulnerabilities exploited by worms at the network level. We propose Vigilante, a new end-to-end architecture to contain worms automatically that addresses these limitations. In Vigilante, hosts detect worms by instrumenting vulnerable programs to analyze infection attempts. We introduce dynamic data-flow analysis: a broad-coverage host-based algorithm that can detect unknown worms by tracking the flow of data from network messages and disallowing unsafe uses of this data. We also show how to integrate other host-based detection mechanisms into the Vigilante architecture. Upon detection, hosts generate self-certifying alerts (SCAs), a new type of security alert that can be inexpensively verified by any vulnerable host. Using SCAs, hosts can cooperate to contain an outbreak, without having to trust each other. Vigilante broadcasts SCAs over an overlay network that propagates alerts rapidly and resiliently. Hosts receiving an SCA protect themselves by generating filters with vulnerability condition slicing: an algorithm that performs dynamic analysis of the vulnerable program to identify control-flow conditions that lead to successful attacks. These filters block the worm attack and all its polymorphic mutations that follow the execution path identified by the SCA. Our results show that Vigilante can contain fast-spreading worms that exploit unknown vulnerabilities, and that Vigilante's filters introduce a negligible performance overhead. Vigilante does not require any changes to hardware, compilers, operating systems, or the source code of vulnerable programs; therefore, it can be used to protect current software binaries.

蠕虫的控制必须是自动的，因为蠕虫的传播速度太快，人类无法做出反应。最近的工作提出了网络级技术来自动遏制蠕虫;这些技术具有局限性，因为没有关于蠕虫在网络级别利用的漏洞的信息。我们提出了Vigilante，这是一种新的端到端架构，可以自动包含蠕虫，从而解决这些限制。在Vigilante中，主机通过检测易受攻击的程序来分析感染企图来检测蠕虫。我们介绍动态数据流分析:一种广泛覆盖的基于主机的算法，可以通过跟踪来自网络消息的数据流并禁止不安全使用这些数据来检测未知蠕虫。我们还展示了如何将其他基于主机的检测机制集成到Vigilante架构中。一旦检测到，主机就会生成自认证警报(sca)，这是一种新型的安全警报，任何易受攻击的主机都可以廉价地对其进行验证。使用sca，主机可以合作遏制爆发，而不必相互信任。Vigilante在覆盖网络上广播sca，该网络快速而有弹性地传播警报。接收SCA的主机通过生成带有漏洞条件切片的过滤器来保护自己:漏洞条件切片是一种算法，对易受攻击的程序执行动态分析，以识别导致成功攻击的控制流条件。这些过滤器阻止蠕虫攻击及其遵循SCA确定的执行路径的所有多态突变。我们的结果表明，Vigilante可以包含利用未知漏洞的快速传播蠕虫，并且Vigilante的过滤器引入的性能开销可以忽略不计。Vigilante不需要对硬件、编译器、操作系统或易受攻击程序的源代码进行任何更改;因此，它可以用来保护当前的软件二进制文件。

{"title":"Vigilante: End-to-end containment of Internet worm epidemics","authors":"Manuel Costa, J. Crowcroft, M. Castro, A. Rowstron, Lidong Zhou, Lintao Zhang, P. Barham","doi":"10.1145/1455258.1455259","DOIUrl":"https://doi.org/10.1145/1455258.1455259","url":null,"abstract":"Worm containment must be automatic because worms can spread too fast for humans to respond. Recent work proposed network-level techniques to automate worm containment; these techniques have limitations because there is no information about the vulnerabilities exploited by worms at the network level. We propose Vigilante, a new end-to-end architecture to contain worms automatically that addresses these limitations.\u0000 In Vigilante, hosts detect worms by instrumenting vulnerable programs to analyze infection attempts. We introduce dynamic data-flow analysis: a broad-coverage host-based algorithm that can detect unknown worms by tracking the flow of data from network messages and disallowing unsafe uses of this data. We also show how to integrate other host-based detection mechanisms into the Vigilante architecture. Upon detection, hosts generate self-certifying alerts (SCAs), a new type of security alert that can be inexpensively verified by any vulnerable host. Using SCAs, hosts can cooperate to contain an outbreak, without having to trust each other. Vigilante broadcasts SCAs over an overlay network that propagates alerts rapidly and resiliently. Hosts receiving an SCA protect themselves by generating filters with vulnerability condition slicing: an algorithm that performs dynamic analysis of the vulnerable program to identify control-flow conditions that lead to successful attacks. These filters block the worm attack and all its polymorphic mutations that follow the execution path identified by the SCA.\u0000 Our results show that Vigilante can contain fast-spreading worms that exploit unknown vulnerabilities, and that Vigilante's filters introduce a negligible performance overhead. Vigilante does not require any changes to hardware, compilers, operating systems, or the source code of vulnerable programs; therefore, it can be used to protect current software binaries.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"64 1","pages":"9:1-9:68"},"PeriodicalIF":1.5,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88762157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Energy-aware lossless data compression 能量感知无损数据压缩

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2006-08-01 DOI: 10.1145/1151690.1151692

K. Barr, K. Asanović

Wireless transmission of a single bit can require over 1000 times more energy than a single computation. It can therefore be beneficial to perform additional computation to reduce the number of bits transmitted. If the energy required to compress data is less than the energy required to send it, there is a net energy savings and an increase in battery life for portable computers. This article presents a study of the energy savings possible by losslessly compressing data prior to transmission. A variety of algorithms were measured on a StrongARM SA-110 processor. This work demonstrates that, with several typical compression algorithms, there is a actually a net energy increase when compression is applied before transmission. Reasons for this increase are explained and suggestions are made to avoid it. One such energy-aware suggestion is asymmetric compression, the use of one compression algorithm on the transmit side and a different algorithm for the receive path. By choosing the lowest-energy compressor and decompressor on the test platform, overall energy to send and receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up to 57% over the default symmetric zlib scheme.

无线传输一个比特需要的能量是一次计算的1000多倍。因此，执行额外的计算以减少传输的比特数可能是有益的。如果压缩数据所需的能量少于发送数据所需的能量，那么就会节省净能量，并增加便携式计算机的电池寿命。本文介绍了在传输前对数据进行无损压缩可能节省的能源的研究。在StrongARM SA-110处理器上测量了各种算法。这项工作表明，使用几种典型的压缩算法，在传输前进行压缩实际上会增加净能量。解释了这种增长的原因，并提出了避免这种增长的建议。其中一个对能量敏感的建议是不对称压缩，即在发送端使用一种压缩算法，而在接收路径上使用不同的算法。通过在测试平台上选择能量最低的压缩机和减压器，与精心选择的对称对相比，发送和接收数据的总能量可减少11%，或比默认的对称zlib方案最多减少57%。

{"title":"Energy-aware lossless data compression","authors":"K. Barr, K. Asanović","doi":"10.1145/1151690.1151692","DOIUrl":"https://doi.org/10.1145/1151690.1151692","url":null,"abstract":"Wireless transmission of a single bit can require over 1000 times more energy than a single computation. It can therefore be beneficial to perform additional computation to reduce the number of bits transmitted. If the energy required to compress data is less than the energy required to send it, there is a net energy savings and an increase in battery life for portable computers. This article presents a study of the energy savings possible by losslessly compressing data prior to transmission. A variety of algorithms were measured on a StrongARM SA-110 processor. This work demonstrates that, with several typical compression algorithms, there is a actually a net energy increase when compression is applied before transmission. Reasons for this increase are explained and suggestions are made to avoid it. One such energy-aware suggestion is asymmetric compression, the use of one compression algorithm on the transmit side and a different algorithm for the receive path. By choosing the lowest-energy compressor and decompressor on the test platform, overall energy to send and receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up to 57% over the default symmetric zlib scheme.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"16 1","pages":"250-291"},"PeriodicalIF":1.5,"publicationDate":"2006-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90763250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 403