ASPLOS XI最新文献_第2页

Scalable selective re-execution for EDGE architectures EDGE架构的可伸缩选择性重新执行

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024408

R. Desikan, S. Sethumadhavan, D. Burger, S. Keckler

Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions affected by the mis-speculation, instead of all instructions. In this paper we introduce a new selective re-execution mechanism that exploits the properties of a dataflow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to window sizes of thousands of instructions with high performance. This distributed selective re-execution (DSRE) protocol permits multiple speculative waves of computation to be traversing a dataflow graph simultaneously, with a commit wave propagating behind them to ensure correct execution. We evaluate one application of this protocol to provide efficient recovery for load-store dependence speculation. Unlike traditional dataflow architectures which resorted to single-assignment memory semantics, the DSRE protocol combines dataflow execution with speculation to enable high performance and conventional sequential memory semantics. Our experiments show that the DSRE protocol results in an average 17% speedup over the best dependence predictor proposed to date, and obtains 82% of the performance possible with a perfect oracle directing the issue of loads.

在具有大指令窗口和深管道的现代微处理器中，管道刷新变得越来越昂贵。选择性重执行是一种技术，它可以通过只重执行受错误推测影响的指令而不是所有指令来减少错误推测的惩罚。在本文中，我们引入了一种新的选择性重执行机制，该机制利用类似数据流的显式数据图执行(EDGE)架构的属性来支持有效的错误推测恢复，同时以高性能扩展到数千条指令的窗口大小。这种分布式选择性重执行(DSRE)协议允许多个推测波同时遍历数据流图，并在它们后面传播提交波以确保正确执行。我们评估了该协议的一个应用，以提供有效的恢复负载-存储依赖推测。与传统的采用单分配内存语义的数据流架构不同，DSRE协议将数据流执行与推测结合起来，以实现高性能和传统的顺序内存语义。我们的实验表明，与迄今为止提出的最佳依赖预测器相比，DSRE协议的平均加速速度提高了17%，并且通过一个完美的oracle来指导负载问题，可以获得82%的性能。

{"title":"Scalable selective re-execution for EDGE architectures","authors":"R. Desikan, S. Sethumadhavan, D. Burger, S. Keckler","doi":"10.1145/1024393.1024408","DOIUrl":"https://doi.org/10.1145/1024393.1024408","url":null,"abstract":"Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions affected by the mis-speculation, instead of all instructions. In this paper we introduce a new selective re-execution mechanism that exploits the properties of a dataflow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to window sizes of thousands of instructions with high performance. This distributed selective re-execution (DSRE) protocol permits multiple speculative waves of computation to be traversing a dataflow graph simultaneously, with a commit wave propagating behind them to ensure correct execution. We evaluate one application of this protocol to provide efficient recovery for load-store dependence speculation. Unlike traditional dataflow architectures which resorted to single-assignment memory semantics, the DSRE protocol combines dataflow execution with speculation to enable high performance and conventional sequential memory semantics. Our experiments show that the DSRE protocol results in an average 17% speedup over the best dependence predictor proposed to date, and obtains 82% of the performance possible with a perfect oracle directing the issue of loads.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123816017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Coherence decoupling: making use of incoherence 相干解耦:利用不相干

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024406

Jaehyuk Huh, Jichuan Chang, D. Burger, G. Sohi

This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.

本文探讨了一种称为相干解耦的新技术，该技术将传统的缓存一致性协议分解为两个协议:推测缓存查找(SCL)协议和安全的备份一致性协议。SCL协议产生一个推测的负载值，通常来自无效的缓存行，允许处理器使用不一致的数据进行计算。同时，相干协议获得必要的相干权限和正确的值。最终，不相干数据的推测性使用可以通过相干数据进行验证。因此，相干解耦可以大大减少——如果不能消除——虚假分享的影响。此外，相干解耦还可以减少真正共享带来的延迟。SCL协议通过推测性地将更新写入无效行来减少这些延迟，从而提高推测的准确性，而不会使保证正确性的简单底层一致性协议复杂化。使用全系统模拟器和商业和科学基准的混合来评估相干解耦的性能优势。我们的研究结果表明，40%到90%的相干缺失可以被正确推测，因此它们的延迟部分或全部被隐藏。在大多数情况下，一致性缺失的延迟会对性能产生影响，这种能力会导致性能提高3%到16%以上。

{"title":"Coherence decoupling: making use of incoherence","authors":"Jaehyuk Huh, Jichuan Chang, D. Burger, G. Sohi","doi":"10.1145/1024393.1024406","DOIUrl":"https://doi.org/10.1145/1024393.1024406","url":null,"abstract":"This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128449534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

An ultra low-power processor for sensor networks 用于传感器网络的超低功耗处理器

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024397

Virantha N. Ekanayake, IV ClintonKelly, R. Manohar

We present a novel processor architecture designed specifically for use in low-power wireless sensor-network nodes. Our sensor network asynchronous processor (SNAP/LE) is based on an asynchronous data-driven 16-bit RISC core with an extremely low-power idle state, and a wakeup response latency on the order of tens of nanoseconds. The processor instruction set is optimized for sensor-network applications, with support for event scheduling, pseudo-random number generation, bitfield operations, and radio/sensor interfaces. SNAP/LE has a hardware event queue and event coprocessors, which allow the processor to avoid the overhead of operating system software (such as task schedulers and external interrupt servicing), while still providing a straightforward programming interface to the designer. The processor can meet performance levels required for data monitoring applications while executing instructions with tens of picojoules of energy.We evaluate the energy consumption of SNAP/LE with several applications representative of the workload found in data-gathering wireless sensor networks. We compare our architecture and software against existing platforms for sensor networks, quantifying both the software and hardware benefits of our approach.

我们提出了一种新的处理器架构，专门用于低功耗无线传感器网络节点。我们的传感器网络异步处理器(SNAP/LE)基于异步数据驱动的16位RISC内核，具有极低的空闲状态，唤醒响应延迟为数十纳秒。处理器指令集针对传感器网络应用程序进行了优化，支持事件调度、伪随机数生成、位域操作和无线电/传感器接口。SNAP/LE具有硬件事件队列和事件协处理器，这允许处理器避免操作系统软件的开销(例如任务调度程序和外部中断服务)，同时仍然为设计人员提供直接的编程接口。该处理器可以满足数据监控应用所需的性能水平，同时以数十皮焦耳的能量执行指令。我们用几个代表数据采集无线传感器网络工作负载的应用程序来评估SNAP/LE的能耗。我们将我们的架构和软件与现有的传感器网络平台进行比较，量化我们方法的软件和硬件优势。

{"title":"An ultra low-power processor for sensor networks","authors":"Virantha N. Ekanayake, IV ClintonKelly, R. Manohar","doi":"10.1145/1024393.1024397","DOIUrl":"https://doi.org/10.1145/1024393.1024397","url":null,"abstract":"We present a novel processor architecture designed specifically for use in low-power wireless sensor-network nodes. Our sensor network asynchronous processor (SNAP/LE) is based on an asynchronous data-driven 16-bit RISC core with an extremely low-power idle state, and a wakeup response latency on the order of tens of nanoseconds. The processor instruction set is optimized for sensor-network applications, with support for event scheduling, pseudo-random number generation, bitfield operations, and radio/sensor interfaces. SNAP/LE has a hardware event queue and event coprocessors, which allow the processor to avoid the overhead of operating system software (such as task schedulers and external interrupt servicing), while still providing a straightforward programming interface to the designer. The processor can meet performance levels required for data monitoring applications while executing instructions with tens of picojoules of energy.We evaluate the energy consumption of SNAP/LE with several applications representative of the workload found in data-gathering wireless sensor networks. We compare our architecture and software against existing platforms for sensor networks, quantifying both the software and hardware benefits of our approach.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131550466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 137

Compiler orchestrated prefetching via speculation and predication 编译器通过推测和预测编排预取

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024416

R. Rabbah, Hariharan Sandanagobalane, M. Ekpanyapong, W. Wong

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointer-chasing applications, and demonstrate how they may benefit from a novel scheme of it sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.

本文介绍了一个编译器编排的预取系统，作为一个统一的框架，旨在改善处理速度和内存访问延迟之间的差距。我们将优化的范围集中在程序依赖图的特定子集上，这些子集简洁地描述了常规基于数组的应用程序和不规则指针密集型程序的内存访问模式。我们说明了通过推测执行的程序嵌入式预计算如何能够准确地预测和有效地预取未来的内存引用，而开销可以忽略不计。所提出的技术在Itanium 2处理器上将七个SPEC基准测试和两个OLDEN基准测试的总运行时间减少了27%。除了这些改进之外，还有一些最先进的优化，包括软件流水线和数据预取。此外，我们使用周期精确的模拟来识别重要和轻量级的架构创新，从而进一步缓解内存系统瓶颈。特别地，我们将关注指针跟踪应用程序中最具挑战性的一类，并演示它们如何从一种新的指针哨兵预取方案中受益。我们对12个SPEC基准测试的结果表明，由内存系统引起的45%的处理器停滞是可以避免的。本文中的技术可以用很少的指令开销有效地掩盖长内存延迟，并且可以很容易地提高当今处理器的性能。

{"title":"Compiler orchestrated prefetching via speculation and predication","authors":"R. Rabbah, Hariharan Sandanagobalane, M. Ekpanyapong, W. Wong","doi":"10.1145/1024393.1024416","DOIUrl":"https://doi.org/10.1145/1024393.1024416","url":null,"abstract":"This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointer-chasing applications, and demonstrate how they may benefit from a novel scheme of it sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128734015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Programming with transactional coherence and consistency (TCC) 使用事务一致性和一致性(TCC)进行编程

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1037949.1024395

Lance Hammond, Brian D. Carlstrom, Vicky Wong, Ben Hertzberg, Michael K. Chen, C. Kozyrakis, K. Olukotun

Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction completes, it writes all of its newly produced state to shared memory atomically, while restarting other processors that have speculatively read stale data. With this mechanism, a TCC-based system automatically handles data synchronization correctly, without programmer intervention. To gain the benefits of TCC, programs must be decomposed into transactions. We describe two basic programming language constructs for decomposing programs into transactions, a loop conversion syntax and a general transaction-forking mechanism. With these constructs, writing correct parallel programs requires only small, incremental changes to correct sequential programs. The performance of these programs may then easily be optimized, based on feedback from real program execution, using a few simple techniques.

事务一致性(Transactional Coherence and Consistency, TCC)提供了一种通过在事务中执行所有代码来简化并行编程的方法。在TCC系统中，事务是并行工作、沟通和连贯的基本单位。当每个事务完成时，它自动地将所有新生成的状态写入共享内存，同时重新启动其他推测读取过时数据的处理器。使用这种机制，基于tcc的系统可以自动正确地处理数据同步，而无需程序员干预。为了获得TCC的好处，必须将程序分解为事务。我们描述了将程序分解为事务的两种基本编程语言结构，一种循环转换语法和一种通用的事务分叉机制。有了这些结构，编写正确的并行程序只需要对顺序程序进行小的、增量的修改。这些程序的性能可以很容易地优化，基于实际程序执行的反馈，使用一些简单的技术。

引用次数: 145

HOIST: a system for automatically deriving static analyzers for embedded systems HOIST:用于自动生成嵌入式系统静态分析器的系统

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024410

J. Regehr, A. Reid

Embedded software must meet conflicting requirements such as be-ing highly reliable, running on resource-constrained platforms, and being developed rapidly. Static program analysis can help meet all of these goals. People developing analyzers for embedded object code face a difficult problem: writing an abstract version of each instruction in the target architecture(s). This is currently done by hand, resulting in abstract operations that are both buggy and im-precise. We have developed Hoist: a novel system that solves these problems by automatically constructing abstract operations using a microprocessor (or simulator) as its own specification. With almost no input from a human, Hoist generates a collection of C func-tions that are ready to be linked into an abstract interpreter. We demonstrate that Hoist generates abstract operations that are cor-rect, having been extensively tested, sufficiently fast, and substan-tially more precise than manually written abstract operations. Hoist is currently limited to eight-bit machines due to costs exponential in the word size of the target architecture. It is essential to be able to analyze software running on these small processors: they are important and ubiquitous, with many embedded and safety-critical systems being based on them.

嵌入式软件必须满足高可靠性、在资源受限的平台上运行和快速开发等相互冲突的需求。静态程序分析可以帮助实现所有这些目标。为嵌入式目标代码开发分析程序的人们面临着一个难题:在目标体系结构中编写每个指令的抽象版本。目前这是手工完成的，导致抽象操作既错误又不精确。我们已经开发了葫芦:一个新颖的系统，通过使用微处理器(或模拟器)作为自己的规范自动构建抽象操作来解决这些问题。在几乎不需要人工输入的情况下，Hoist生成了一组C函数，这些函数可以被链接到抽象解释器中。我们证明了葫芦生成的抽象操作是正确的，经过广泛的测试，足够快，并且比手动编写的抽象操作更精确。由于目标架构的字长呈指数级增长，Hoist目前仅限于8位机器。能够分析在这些小型处理器上运行的软件是至关重要的:它们非常重要且无处不在，许多嵌入式和安全关键型系统都基于它们。

{"title":"HOIST: a system for automatically deriving static analyzers for embedded systems","authors":"J. Regehr, A. Reid","doi":"10.1145/1024393.1024410","DOIUrl":"https://doi.org/10.1145/1024393.1024410","url":null,"abstract":"Embedded software must meet conflicting requirements such as be-ing highly reliable, running on resource-constrained platforms, and being developed rapidly. Static program analysis can help meet all of these goals. People developing analyzers for embedded object code face a difficult problem: writing an abstract version of each instruction in the target architecture(s). This is currently done by hand, resulting in abstract operations that are both buggy and im-precise. We have developed Hoist: a novel system that solves these problems by automatically constructing abstract operations using a microprocessor (or simulator) as its own specification. With almost no input from a human, Hoist generates a collection of C func-tions that are ready to be linked into an abstract interpreter. We demonstrate that Hoist generates abstract operations that are cor-rect, having been extensively tested, sufficiently fast, and substan-tially more precise than manually written abstract operations. Hoist is currently limited to eight-bit machines due to costs exponential in the word size of the target architecture. It is essential to be able to analyze software running on these small processors: they are important and ubiquitous, with many embedded and safety-critical systems being based on them.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129629058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Devirtualizable virtual machines enabling general, single-node, online maintenance 可反虚拟化的虚拟机，支持一般、单节点、在线维护

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024419

David E. Lowell, Yasushi Saito, Eileen J. Samberg

Maintenance is the dominant source of downtime at high availability sites. Unfortunately, the dominant mechanism for reducing this downtime, cluster rolling upgrade, has two shortcomings that have prevented its broad acceptance. First, cluster-style maintenance over many nodes is typically performed a few nodes at a time, mak-ing maintenance slow and often impractical. Second, cluster-style maintenance does not work on single-node systems, despite the fact that their unavailability during maintenance can be painful for organizations. In this paper, we propose a novel technique for online maintenance that uses virtual machines to provide maintenance on single nodes, allowing parallel maintenance over multiple nodes, and online maintenance for standalone servers. We present the Microvisor, our prototype virtual machine system that is custom tailored to the needs of online maintenance. Unlike general purpose virtual machine environments that induce continual 10-20% over-head, the Microvisor virtualizes the hardware only during periods of active maintenance, letting the guest OS run at full speed most of the time. Unlike past attempts at virtual machine optimization, we do not compromise OS transparency. We instead give up generality and tailor our virtual machine system to the minimum needs of online maintenance, eschewing features, such as I/O and memory virtualization, that it does not strictly require. The result is a very thin virtual machine system that induces only 5.6% CPU overhead when virtualizing the hardware, and zero CPU overhead when devirtualized. Using the Microvisor, we demonstrate an online OS upgrade on a live, single-node web server, reducing downtime from one hour to less than one minute.

维护是高可用性站点停机的主要来源。不幸的是，减少停机时间的主要机制——集群滚动升级——有两个缺点，阻碍了它的广泛接受。首先，对许多节点的集群式维护通常一次只执行几个节点，这使得维护速度很慢，而且往往不切实际。其次，集群式维护不适用于单节点系统，尽管在维护期间它们的不可用性可能会让组织感到痛苦。在本文中，我们提出了一种新的在线维护技术，该技术使用虚拟机在单个节点上提供维护，允许在多个节点上并行维护，并对独立服务器进行在线维护。我们介绍Microvisor，我们的原型虚拟机系统，是定制的，以满足在线维护的需求。不像一般用途的虚拟机环境会导致持续的10-20%的开销，Microvisor只在主动维护期间虚拟化硬件，让客户操作系统在大多数时间以全速运行。不像过去对虚拟机优化的尝试，我们不会损害操作系统的透明度。相反，我们放弃了通用性，并根据在线维护的最低需求定制虚拟机系统，避开了它并不严格需要的功能，例如I/O和内存虚拟化。结果是一个非常瘦的虚拟机系统，在虚拟化硬件时仅产生5.6%的CPU开销，而在去虚拟化时则为零CPU开销。使用Microvisor，我们演示了在一个实时的单节点web服务器上进行在线操作系统升级，将停机时间从一小时减少到不到一分钟。

{"title":"Devirtualizable virtual machines enabling general, single-node, online maintenance","authors":"David E. Lowell, Yasushi Saito, Eileen J. Samberg","doi":"10.1145/1024393.1024419","DOIUrl":"https://doi.org/10.1145/1024393.1024419","url":null,"abstract":"Maintenance is the dominant source of downtime at high availability sites. Unfortunately, the dominant mechanism for reducing this downtime, cluster rolling upgrade, has two shortcomings that have prevented its broad acceptance. First, cluster-style maintenance over many nodes is typically performed a few nodes at a time, mak-ing maintenance slow and often impractical. Second, cluster-style maintenance does not work on single-node systems, despite the fact that their unavailability during maintenance can be painful for organizations. In this paper, we propose a novel technique for online maintenance that uses virtual machines to provide maintenance on single nodes, allowing parallel maintenance over multiple nodes, and online maintenance for standalone servers. We present the Microvisor, our prototype virtual machine system that is custom tailored to the needs of online maintenance. Unlike general purpose virtual machine environments that induce continual 10-20% over-head, the Microvisor virtualizes the hardware only during periods of active maintenance, letting the guest OS run at full speed most of the time. Unlike past attempts at virtual machine optimization, we do not compromise OS transparency. We instead give up generality and tailor our virtual machine system to the minimum needs of online maintenance, eschewing features, such as I/O and memory virtualization, that it does not strictly require. The result is a very thin virtual machine system that induces only 5.6% CPU overhead when virtualizing the hardware, and zero CPU overhead when devirtualized. Using the Microvisor, we demonstrate an online OS upgrade on a live, single-node web server, reducing downtime from one hour to less than one minute.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129758835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

D-SPTF: decentralized request distribution in brick-based storage systems D-SPTF:基于砖的存储系统中的分散请求分发

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024399

Christopher R. Lumb, Richard A. Golding

Distributed Shortest-Positioning Time First (D-SPTF) is a request distribution protocol for decentralized systems of storage servers. D-SPTF exploits high-speed interconnects to dynamically select which server, among those with a replica, should service each read request. In doing so, it simultaneously balances load, exploits the aggregate cache capacity, and reduces positioning times for cache misses. For network latencies expected in storage clusters (e.g., 10--200μs), D-SPTF performs as well as would a hypothetical centralized system with the same collection of CPU, cache, and disk resources. Compared to popular decentralized approaches, D-SPTF achieves up to 65% higher throughput and adapts more cleanly to heterogenous server capabilities.

分布式最短定位时间优先(D-SPTF)是一种用于分布式存储服务器系统的请求分发协议。D-SPTF利用高速互连，在具有副本的服务器中动态选择应该服务每个读请求的服务器。在此过程中，它同时平衡负载，利用聚合缓存容量，并减少缓存丢失的定位时间。对于存储集群中预期的网络延迟(例如，10—200μs)， D-SPTF的性能与具有相同CPU，缓存和磁盘资源集合的假设集中式系统一样好。与流行的分散式方法相比，D-SPTF实现了高达65%的高吞吐量，并且更清晰地适应异构服务器功能。

引用次数: 38

Spatial computation 空间计算

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024396

M. Budiu, Girish Venkataramani, Tiberiu Chelcea, S. Goldstein

This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units.In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.In this work we demonstrate three features of ASH: (1) that such architectures can be built by automatic compilation of C programs; (2) that distributed computation is in some respects fundamentally different from monolithic superscalar processors; and (3) that ASIC implementations of ASH use three orders of magnitude less energy compared to high-end superscalar processors, while being on average only 33% slower in performance (3.5x worst-case).

本文描述了一种基于将高级语言程序直接翻译成硬件结构的计算机体系结构——空间计算(SC)。SC程序实现是完全分布式的，没有集中控制。SC电路是以牺牲计算单元为代价来优化导线的。在本文中，我们研究了SC: ASH(专用硬件)的一种特殊实现。在计算比通信便宜的假设下，ASH复制计算单元以简化互连，构建一个使用非常简单，完全专用的通信通道的系统。因此，数据路径上的通信不需要仲裁;唯一需要的仲裁是访问内存。ASH依赖于非常简单的硬件原语，不使用关联结构、不使用多端口寄存器文件、不使用调度逻辑、不使用广播和时钟。因此，ASH硬件速度很快，而且非常节能。在这项工作中，我们展示了ASH的三个特点:(1)这种体系结构可以通过C程序的自动编译来构建;(2)分布式计算在某些方面与单片超标量处理器有着根本的不同;(3)与高端超标量处理器相比，ASH的ASIC实现使用的能量减少了三个数量级，而性能平均仅降低33%(最坏情况下为3.5倍)。

{"title":"Spatial computation","authors":"M. Budiu, Girish Venkataramani, Tiberiu Chelcea, S. Goldstein","doi":"10.1145/1024393.1024396","DOIUrl":"https://doi.org/10.1145/1024393.1024396","url":null,"abstract":"This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units.In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.In this work we demonstrate three features of ASH: (1) that such architectures can be built by automatic compilation of C programs; (2) that distributed computation is in some respects fundamentally different from monolithic superscalar processors; and (3) that ASIC implementations of ASH use three orders of magnitude less energy compared to high-end superscalar processors, while being on average only 33% slower in performance (3.5x worst-case).","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134111392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

Secure program execution via dynamic information flow tracking 通过动态信息流跟踪安全程序执行

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024404

Edward Suh, Jaewook Lee, Srini Devadas, David Zhang

We present a simple architectural mechanism called dynamic information flow tracking that can significantly improve the security of computing systems with negligible performance overhead. Dynamic information flow tracking protects programs against malicious software attacks by identifying spurious information flows from untrusted I/O and restricting the usage of the spurious information.Every security attack to take control of a program needs to transfer the program's control to malevolent code. In our approach, the operating system identifies a set of input channels as spurious, and the processor tracks all information flows from those inputs. A broad range of attacks are effectively defeated by checking the use of the spurious values as instructions and pointers.Our protection is transparent to users or application programmers; the executables can be used without any modification. Also, our scheme only incurs, on average, a memory overhead of 1.4% and a performance overhead of 1.1%.

我们提出了一种简单的架构机制，称为动态信息流跟踪，它可以显著提高计算系统的安全性，而性能开销可以忽略不计。动态信息流跟踪通过识别来自不受信任的I/O的虚假信息流并限制虚假信息的使用，从而保护程序免受恶意软件的攻击。每次控制程序的安全攻击都需要将程序的控制权转移给恶意代码。在我们的方法中，操作系统将一组输入通道识别为伪通道，处理器跟踪来自这些输入的所有信息流。通过检查虚假值作为指令和指针的使用，可以有效地挫败各种攻击。我们的保护对用户或应用程序程序员是透明的;可执行文件无需任何修改即可使用。此外，我们的方案平均只产生1.4%的内存开销和1.1%的性能开销。

引用次数: 830