ASPLOS X最新文献

Dynamic dead-instruction detection and elimination 动态死指令检测和消除

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605419

J. A. Butts, G. Sohi

We observe a non-negligible fraction--3 to 16% in our benchmarks--of dynamically dead instructions, dynamic instruction instances that generate unused results. The majority of these instructions arise from static instructions that also produce useful results. We find that compiler optimization (specifically instruction scheduling) creates a significant portion of these partially dead static instructions. We show that most of the dynamically instructions arise from a small set of static instructions that produce dead values most of the time.We leverage this locality by proposing a dead instruction predictor and presenting a scheme to avoid the execution of predicted-dead instructions. Our predictor achieves an accuracy of 93% while identifying over 91% of the dead instructions using less than 5 KB of state. We achieve such high accuracies by leveraging future control flow information (i.e., branch predictions) to distinguish between useless and useful instances of the same static instruction.We then present a mechanism to avoid the register allocation, instruction scheduling, and execution of predicted dead instructions. We measure reductions in resource utilization averaging over 5% and sometimes exceeding 10%, covering physical register management (allocation and freeing), register file read and write traffic, and data cache accesses. Performance improves by an average of 3.6% on an architecture exhibiting resource contention. Additionally, our scheme frees future compilers from the need to consider the costs of dead instructions, enabling more aggressive code motion and optimization. Simultaneously, it mitigates the need for good path profiling information in making inter-block code motion decisions.

我们观察到一个不可忽略的部分——在我们的基准测试中为3%到16%——是动态死指令，即生成未使用结果的动态指令实例。这些指令中的大多数来自静态指令，这些指令也会产生有用的结果。我们发现编译器优化(特别是指令调度)创建了这些部分死亡的静态指令的很大一部分。我们表明，大多数动态指令都是由一小部分静态指令产生的，这些静态指令在大多数情况下会产生死值。我们利用这种局域性，提出了一个死亡指令预测器，并提出了一种避免执行预测死亡指令的方案。我们的预测器在使用不到5 KB的状态识别超过91%的死指令时达到了93%的准确率。我们通过利用未来的控制流信息(例如，分支预测)来区分相同静态指令的无用和有用实例，从而实现如此高的准确性。然后，我们提出了一种机制来避免寄存器分配、指令调度和预测死指令的执行。我们测量的资源利用率降低平均超过5%，有时超过10%，包括物理寄存器管理(分配和释放)、寄存器文件读写流量和数据缓存访问。在表现出资源争用的架构上，性能平均提高3.6%。此外，我们的方案使未来的编译器不必考虑死指令的代价，从而实现更积极的代码移动和优化。同时，它减少了在做出块间代码运动决策时对良好路径分析信息的需求。

{"title":"Dynamic dead-instruction detection and elimination","authors":"J. A. Butts, G. Sohi","doi":"10.1145/605397.605419","DOIUrl":"https://doi.org/10.1145/605397.605419","url":null,"abstract":"We observe a non-negligible fraction--3 to 16% in our benchmarks--of dynamically dead instructions, dynamic instruction instances that generate unused results. The majority of these instructions arise from static instructions that also produce useful results. We find that compiler optimization (specifically instruction scheduling) creates a significant portion of these partially dead static instructions. We show that most of the dynamically instructions arise from a small set of static instructions that produce dead values most of the time.We leverage this locality by proposing a dead instruction predictor and presenting a scheme to avoid the execution of predicted-dead instructions. Our predictor achieves an accuracy of 93% while identifying over 91% of the dead instructions using less than 5 KB of state. We achieve such high accuracies by leveraging future control flow information (i.e., branch predictions) to distinguish between useless and useful instances of the same static instruction.We then present a mechanism to avoid the register allocation, instruction scheduling, and execution of predicted dead instructions. We measure reductions in resource utilization averaging over 5% and sometimes exceeding 10%, covering physical register management (allocation and freeing), register file read and write traffic, and data cache accesses. Performance improves by an average of 3.6% on an architecture exhibiting resource contention. Additionally, our scheme frees future compilers from the need to consider the costs of dead instructions, enabling more aggressive code motion and optimization. Simultaneously, it mitigates the need for good path profiling information in making inter-block code motion decisions.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125008281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

A stateless, content-directed data prefetching mechanism 一种无状态、面向内容的数据预取机制

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605427

Robert Cooksey, S. Jourdan, D. Grunwald

Although central processor speeds continues to improve, improvements in overall system performance are increasingly hampered by memory latency, especially for pointer-intensive applications. To counter this loss of performance, numerous data and instruction prefetch mechanisms have been proposed. Recently, several proposals have posited a memory-side prefetcher; typically, these prefetchers involve a distinct processor that executes a program slice that would effectively prefetch data needed by the primary program. Alternative designs embody large state tables that learn the miss reference behavior of the processor and attempt to prefetch likely misses.This paper proposes Content-Directed Data Prefetching, a data prefetching architecture that exploits the memory allocation used by operating systems and runtime systems to improve the performance of pointer-intensive applications constructed using modern language systems. This technique is modeled after conservative garbage collection, and prefetches "likely" virtual addresses observed in memory references. This prefetching mechanism uses the underlying data of the application, and provides an 11.3% speedup using no additional processor state. By adding less than ½% space overhead to the second level cache, performance can be further increased to 12.6% across a range of "real world" applications.

尽管中央处理器的速度在不断提高，但整体系统性能的提高越来越受到内存延迟的阻碍，特别是对于指针密集型应用程序。为了应对这种性能损失，人们提出了许多数据和指令预取机制。最近，有几个提议提出了一个内存端预取器;通常，这些预取程序涉及一个独立的处理器，该处理器执行一个程序片，该程序片将有效地预取主程序所需的数据。另一种设计包含了大型状态表，这些状态表可以学习处理器的遗漏引用行为，并尝试预取可能的遗漏。本文提出了一种利用操作系统和运行时系统使用的内存分配来提高使用现代语言系统构建的指针密集型应用程序性能的数据预取体系结构——内容导向数据预取。该技术基于保守垃圾收集建模，并预取在内存引用中观察到的“可能的”虚拟地址。这种预取机制使用应用程序的底层数据，并在不使用额外处理器状态的情况下提供11.3%的加速。通过在第二级缓存中添加不到½%的空间开销，在一系列“真实世界”应用程序中，性能可以进一步提高到12.6%。

{"title":"A stateless, content-directed data prefetching mechanism","authors":"Robert Cooksey, S. Jourdan, D. Grunwald","doi":"10.1145/605397.605427","DOIUrl":"https://doi.org/10.1145/605397.605427","url":null,"abstract":"Although central processor speeds continues to improve, improvements in overall system performance are increasingly hampered by memory latency, especially for pointer-intensive applications. To counter this loss of performance, numerous data and instruction prefetch mechanisms have been proposed. Recently, several proposals have posited a memory-side prefetcher; typically, these prefetchers involve a distinct processor that executes a program slice that would effectively prefetch data needed by the primary program. Alternative designs embody large state tables that learn the miss reference behavior of the processor and attempt to prefetch likely misses.This paper proposes Content-Directed Data Prefetching, a data prefetching architecture that exploits the memory allocation used by operating systems and runtime systems to improve the performance of pointer-intensive applications constructed using modern language systems. This technique is modeled after conservative garbage collection, and prefetches \"likely\" virtual addresses observed in memory references. This prefetching mechanism uses the underlying data of the application, and provides an 11.3% speedup using no additional processor state. By adding less than ½% space overhead to the second level cache, performance can be further increased to 12.6% across a range of \"real world\" applications.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128078927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

Understanding and improving operating system effects in control flow prediction 了解和改进操作系统在控制流预测中的作用

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605405

Tao Li, L. John, A. Sivasubramaniam, N. Vijaykrishnan, J. Rubio

Many modern applications result in a significant operating system (OS) component. The OS component has several implications including affecting the control flow transfer in the execution environment. This paper focuses on understanding the operating system effects on control flow transfer and prediction, and designing architectural support to alleviate the bottlenecks. We characterize the control flow transfer of several emerging applications on a commercial operating system. We find that the exception-driven, intermittent invocation of OS code and the user/OS branch history interference increase the misprediction in both user and kernel code.We propose two simple OS-aware control flow prediction techniques to alleviate the destructive impact of user/OS branch interference. The first one consists of capturing separate branch correlation information for user and kernel code. The second one involves using separate branch prediction tables for user and kernel code. We study the improvement contributed by the OS-aware prediction to various branch predictors ranging from simple Gshare to more elegant Agree, Multi-Hybrid and Bi-Mode predictors. On 32K entries predictors, incorporating OS-aware techniques yields up to 34%, 23%, 27% and 9% prediction accuracy improvement in Gshare, Multi-Hybrid, Agree and Bi-Mode predictors, resulting in up to 8% execution speedup.

许多现代应用程序都包含重要的操作系统(OS)组件。操作系统组件有几个含义，包括影响执行环境中的控制流传输。本文的重点是了解操作系统对控制流传输和预测的影响，并设计架构支持来缓解瓶颈。我们描述了商业操作系统上几个新兴应用程序的控制流传输。我们发现异常驱动的操作系统代码的间歇调用和用户/操作系统分支历史的干扰增加了用户和内核代码的错误预测。我们提出了两种简单的操作系统感知控制流预测技术来减轻用户/操作系统分支干扰的破坏性影响。第一个包含为用户和内核代码捕获单独的分支相关信息。第二种方法涉及为用户代码和内核代码使用单独的分支预测表。我们研究了操作系统感知预测对各种分支预测的改进，从简单的Gshare到更优雅的Agree, Multi-Hybrid和Bi-Mode预测。在32K条目预测器上，结合操作系统感知技术，Gshare、Multi-Hybrid、Agree和Bi-Mode预测器的预测精度提高了34%、23%、27%和9%，执行速度提高了8%。

{"title":"Understanding and improving operating system effects in control flow prediction","authors":"Tao Li, L. John, A. Sivasubramaniam, N. Vijaykrishnan, J. Rubio","doi":"10.1145/605397.605405","DOIUrl":"https://doi.org/10.1145/605397.605405","url":null,"abstract":"Many modern applications result in a significant operating system (OS) component. The OS component has several implications including affecting the control flow transfer in the execution environment. This paper focuses on understanding the operating system effects on control flow transfer and prediction, and designing architectural support to alleviate the bottlenecks. We characterize the control flow transfer of several emerging applications on a commercial operating system. We find that the exception-driven, intermittent invocation of OS code and the user/OS branch history interference increase the misprediction in both user and kernel code.We propose two simple OS-aware control flow prediction techniques to alleviate the destructive impact of user/OS branch interference. The first one consists of capturing separate branch correlation information for user and kernel code. The second one involves using separate branch prediction tables for user and kernel code. We study the improvement contributed by the OS-aware prediction to various branch predictors ranging from simple Gshare to more elegant Agree, Multi-Hybrid and Bi-Mode predictors. On 32K entries predictors, incorporating OS-aware techniques yields up to 34%, 23%, 27% and 9% prediction accuracy improvement in Gshare, Multi-Hybrid, Agree and Bi-Mode predictors, resulting in up to 8% execution speedup.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114312260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Increasing web server throughput with network interface data caching 通过网络接口数据缓存提高web服务器吞吐量

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605423

Hyong-youb Kim, Vijay S. Pai, S. Rixner

This paper introduces network interface data caching, a new technique to reduce local interconnect traffic on networking servers by caching frequently-requested content on a programmable network interface. The operating system on the host CPU determines which data to store in the cache and for which packets it should use data from the cache. To facilitate data reuse across multiple packets and connections, the cache only stores application-level response content (such as HTTP data), with application-level and networking headers generated by the host CPU. Network interface data caching can reduce PCI traffic by up to 57% on a prototype implementation of a uniprocessor web server. This traffic reduction results in up to 31% performance improvement, leading to a peak server throughput of 1571 Mb/s.

本文介绍了网络接口数据缓存，这是一种通过在可编程网络接口上缓存频繁请求的内容来减少网络服务器本地互连流量的新技术。主机CPU上的操作系统决定将哪些数据存储在缓存中，以及应该使用缓存中的哪些数据包。为了方便跨多个数据包和连接的数据重用，缓存仅存储应用程序级响应内容(如HTTP数据)，以及主机CPU生成的应用程序级和网络头。网络接口数据缓存可以在单处理器web服务器的原型实现上减少多达57%的PCI流量。这种流量减少导致高达31%的性能改进，从而使服务器吞吐量达到1571 Mb/s的峰值。

引用次数: 45

ECOSystem: managing energy as a first class operating system resource 生态系统:将能源作为一流的操作系统资源来管理

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605411

Heng Zeng, C. Ellis, A. Lebeck, Amin Vahdat

Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges beyond those of conventional resource management. To meet these challenges we propose the Currentcy Model that unifies energy accounting over diverse hardware components and enables fair allocation of available energy among applications. Our particular goal is to extend battery lifetime by limiting the average discharge rate and to share this limited resource among competing task according to user preferences. To demonstrate how our framework supports explicit control over the battery resource we implemented ECOSystem, a modified Linux, that incorporates our currentcy model. Experimental results show that ECOSystem accurately accounts for the energy consumed by asynchronous device operation, can achieve a target battery lifetime, and proportionally shares the limited energy resource among competing tasks.

最近，能源消耗已被广泛认为是计算机系统设计的一个主要挑战。本文探讨了如何支持能源作为一级操作系统资源。能源由于其全球系统的性质，提出了超越传统资源管理的挑战。为了应对这些挑战，我们提出了currency模型，该模型统一了不同硬件组件的能源核算，并能够在应用程序之间公平分配可用能源。我们的具体目标是通过限制平均放电率来延长电池寿命，并根据用户偏好在竞争任务之间共享有限的资源。为了演示我们的框架如何支持对电池资源的显式控制，我们实现了生态系统，一个修改过的Linux，它包含了我们的当前模型。实验结果表明，生态系统可以准确地计算异步设备运行所消耗的能量，实现目标电池寿命，并在竞争任务之间按比例共享有限的能量资源。

引用次数: 492

Speculative synchronization: applying thread-level speculation to explicitly parallel applications 推测同步:对显式并行应用程序应用线程级推测

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605400

José F. Martínez, J. Torrellas

Barriers, locks, and flags are synchronizing operations widely used programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about the program, or merely for code simplicity.We propose Speculative Synchronization, which applies the philosophy behind Thread-Level Speculation (TLS) to explicitly parallel applications. Speculative threads execute past active barriers, busy locks, and unset flags instead of waiting. The proposed hardware checks for conflicting accesses and, if a violation is detected, offending speculative thread is rolled back to the synchronization point and restarted on the fly. TLS's principle of always keeping a safe thread is key to our proposal: in any speculative barrier, lock, or flag, the existence of one or more safe threads at all times guarantees forward progress, even in the presence of access conflicts or speculative buffer overflow. Our proposal requires simple hardware and no programming effort. Furthermore, it can coexist with conventional synchronization at run time.We use simulations to evaluate 5 compiler- and hand-parallelized applications. Our results show a reduction in the time lost to synchronization of 34% on average, and a reduction in overall program execution time of 7.4% on average.

障碍、锁和标志是程序员和并行编译器广泛使用的同步操作，以产生无竞争的并行程序。通常，这些操作的放置不是最优的，或者是因为对程序的保守假设，或者仅仅是为了代码的简单性。我们提出推测同步，它将线程级推测(TLS)背后的哲学应用于显式并行应用程序。推测线程执行活动障碍、繁忙锁和未设置标志，而不是等待。建议的硬件检查冲突的访问，如果检测到违规，则将违规的推测线程回滚到同步点并立即重新启动。TLS始终保持一个安全线程的原则是我们建议的关键:在任何推测的屏障、锁或标志中，一个或多个安全线程的存在始终保证向前进展，即使在存在访问冲突或推测缓冲区溢出的情况下也是如此。我们的建议只需要简单的硬件，不需要编程。此外，它可以在运行时与传统同步共存。我们使用模拟来评估5个编译器和手动并行化的应用程序。我们的结果显示，同步损失的时间平均减少了34%，总体程序执行时间平均减少了7.4%。

{"title":"Speculative synchronization: applying thread-level speculation to explicitly parallel applications","authors":"José F. Martínez, J. Torrellas","doi":"10.1145/605397.605400","DOIUrl":"https://doi.org/10.1145/605397.605400","url":null,"abstract":"Barriers, locks, and flags are synchronizing operations widely used programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about the program, or merely for code simplicity.We propose Speculative Synchronization, which applies the philosophy behind Thread-Level Speculation (TLS) to explicitly parallel applications. Speculative threads execute past active barriers, busy locks, and unset flags instead of waiting. The proposed hardware checks for conflicting accesses and, if a violation is detected, offending speculative thread is rolled back to the synchronization point and restarted on the fly. TLS's principle of always keeping a safe thread is key to our proposal: in any speculative barrier, lock, or flag, the existence of one or more safe threads at all times guarantees forward progress, even in the presence of access conflicts or speculative buffer overflow. Our proposal requires simple hardware and no programming effort. Furthermore, it can coexist with conventional synchronization at run time.We use simulations to evaluate 5 compiler- and hand-parallelized applications. Our results show a reduction in the time lost to synchronization of 34% on average, and a reduction in overall program execution time of 7.4% on average.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"121 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 189

Mondrian memory protection 蒙德里安记忆保护

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605429

E. Witchel, Josh Cates, K. Asanović

Mondrian memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier page-based systems, MMP allows arbitrary permissions control at the granularity of individual words. We use a compressed permissions table to reduce space overheads and employ two levels of permissions caching to reduce run-time overheads. The protection tables in our implementation add less than 9% overhead to the memory space used by the application. Accessing the protection tables adds than 8% additional memory references to the accesses made by the application. Although it can be layered on top of demand-paged virtual memory, MMP is also well-suited to embedded systems with a single physical address space. We extend MMP to support segment translation which allows a memory segment to appear at another location in the address space. We use this translation to implement zero-copy networking underneath the standard read system call interface, where packet payload fragments are connected together by the translation system to avoid data copying. This saves 52% of the memory references used by a traditional copying network stack.

MMP (Mondrian memory protection)是一种细粒度的保护方案，允许多个保护域灵活地共享内存和导出受保护的服务。与早期基于页面的系统相比，MMP允许在单个单词的粒度上进行任意权限控制。我们使用压缩的权限表来减少空间开销，并使用两级权限缓存来减少运行时开销。在我们的实现中，保护表给应用程序使用的内存空间增加了不到9%的开销。访问保护表为应用程序进行的访问增加了超过8%的额外内存引用。尽管可以在按需分页的虚拟内存之上分层，但MMP也非常适合具有单个物理地址空间的嵌入式系统。我们扩展了MMP以支持段转换，这允许内存段出现在地址空间的另一个位置。我们使用这种转换来实现标准读取系统调用接口下的零拷贝网络，其中数据包有效载荷片段通过转换系统连接在一起以避免数据复制。这节省了传统复制网络堆栈使用的52%的内存引用。

引用次数: 330

Automatically characterizing large scale program behavior 自动描述大规模程序行为

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605403

T. Sherwood, Erez Perelman, Greg Hamerly, B. Calder

Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution.Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.

理解程序行为是计算机体系结构和程序优化的基础。许多程序即使在非常大的尺度上(在程序的整个执行过程中)也有非常不同的行为。这种实现对许多体系结构和编译器技术都有影响，从线程调度到反馈导向优化，再到程序的模拟方式。然而，为了利用时变行为，我们必须首先开发必要的分析工具，以便在执行的大部分时间内自动有效地分析程序行为。我们的目标是开发能够发现和利用程序的大规模行为(在数十亿条指令中看到的行为)的自动技术。实现这一目标的第一步是开发一种与硬件无关的度量，它可以简洁地总结程序中任意执行部分的行为。为此，我们检查基本块向量的使用。我们量化了基本块向量在捕获跨几种不同架构度量的程序行为方面的有效性，探索了几个程序的大规模行为，并开发了一组基于聚类的算法，能够分析这种行为。然后，我们演示了该技术的应用，以自动确定在何处模拟程序，以帮助指导计算机体系结构研究。

{"title":"Automatically characterizing large scale program behavior","authors":"T. Sherwood, Erez Perelman, Greg Hamerly, B. Calder","doi":"10.1145/605397.605403","DOIUrl":"https://doi.org/10.1145/605397.605403","url":null,"abstract":"Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution.Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126712365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1790

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches 一种适用于线延迟控制的片上缓存的自适应非均匀缓存结构

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605420

Changkyu Kim, D. Burger, S. Keckler

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.

不断增长的线路延迟将迫使大型缓存的设计发生实质性变化。传统的缓存体系结构假设缓存层次结构中的每个级别都有一个统一的访问时间。片上通信延迟的增加将使大型片上缓存的命中时间成为缓存内线路物理位置的函数。因此，缓存访问时间将成为连续的延迟，而不是单个离散的延迟。可以利用这种不均匀性来提供对靠近处理器的部分缓存中的缓存线的更快访问。在本文中，我们评估了一系列的高速缓存设计，提供快速命中数兆字节的高速缓存存储器。我们首先提出了这些非统一缓存架构(nuca)的物理设计。我们使用逻辑策略扩展这些物理设计，允许重要数据迁移到同一缓存级别的处理器。我们表明，对于多兆字节的二级缓存，自适应动态NUCA设计实现的IPC是任何大小的统一缓存架构的1.5倍，优于最佳静态NUCA方案11%，优于最佳三级层次结构-同时使用更少的硅面积- 13%，并且在理想的最小命中延迟解决方案的13%以内。

{"title":"An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches","authors":"Changkyu Kim, D. Burger, S. Keckler","doi":"10.1145/605397.605420","DOIUrl":"https://doi.org/10.1145/605397.605420","url":null,"abstract":"Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127583499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 773

A stream compiler for communication-exposed architectures 用于通信公开架构的流编译器

ASPLOS X

Pub Date : 2002-10-05 DOI: 10.1145/605397.605428

Michael I. Gordon, W. Thies, M. Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, J.S.S.M. Wong, H. Hoffmann, David Maze, Saman P. Amarasinghe

With the increasing miniaturization of transistors, wire delays are becoming a dominant factor in microprocessor performance. To address this issue, a number of emerging architectures contain replicated processing units with software-exposed communication between one unit and another (e.g., Raw, SmartMemories, TRIPS). However, for their use to be widespread, it will be necessary to develop compiler technology that enables a portable, high-level language to execute efficiently across a range of wire-exposed architectures.In this paper, we describe our compiler for StreamIt: a high-level, architecture-independent language for streaming applications. We focus on our backend for the Raw processor. Though StreamIt exposes the parallelism and communication patterns of stream programs, some analysis is needed to adapt a stream program to a software-exposed processor. We describe a partitioning algorithm that employs fission and fusion transformations to adjust the granularity of a stream graph, a layout algorithm that maps a stream graph to a given network topology, and a scheduling strategy that generates a fine-grained static communication pattern for each computational element.We have implemented a fully functional compiler that parallelizes StreamIt applications for Raw, including several load-balancing transformations. Using the cycle-accurate Raw simulator, we demonstrate that the StreamIt compiler can automatically map a high-level stream abstraction to Raw without losing performance. We consider this work to be a first step towards a portable programming model for communication-exposed architectures.

随着晶体管的日益小型化，线延迟正成为影响微处理器性能的主要因素。为了解决这个问题，许多新兴的架构包含了复制的处理单元，并在一个单元和另一个单元之间进行软件暴露通信(例如，Raw, SmartMemories, TRIPS)。然而，为了广泛使用它们，有必要开发编译器技术，使可移植的高级语言能够在一系列线公开的体系结构中有效地执行。在本文中，我们描述了StreamIt的编译器:一种用于流应用程序的高级、独立于体系结构的语言。我们专注于Raw处理器的后端。虽然StreamIt公开了流程序的并行性和通信模式，但是需要进行一些分析才能使流程序适应于软件公开的处理器。我们描述了一种使用裂变和融合转换来调整流图粒度的划分算法，一种将流图映射到给定网络拓扑的布局算法，以及一种为每个计算元素生成细粒度静态通信模式的调度策略。我们已经实现了一个功能齐全的编译器，它可以为Raw并行化StreamIt应用程序，包括几个负载平衡转换。使用周期精确的Raw模拟器，我们演示了StreamIt编译器可以在不损失性能的情况下自动将高级流抽象映射到Raw。我们认为这项工作是面向面向通信公开架构的可移植编程模型的第一步。

{"title":"A stream compiler for communication-exposed architectures","authors":"Michael I. Gordon, W. Thies, M. Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, J.S.S.M. Wong, H. Hoffmann, David Maze, Saman P. Amarasinghe","doi":"10.1145/605397.605428","DOIUrl":"https://doi.org/10.1145/605397.605428","url":null,"abstract":"With the increasing miniaturization of transistors, wire delays are becoming a dominant factor in microprocessor performance. To address this issue, a number of emerging architectures contain replicated processing units with software-exposed communication between one unit and another (e.g., Raw, SmartMemories, TRIPS). However, for their use to be widespread, it will be necessary to develop compiler technology that enables a portable, high-level language to execute efficiently across a range of wire-exposed architectures.In this paper, we describe our compiler for StreamIt: a high-level, architecture-independent language for streaming applications. We focus on our backend for the Raw processor. Though StreamIt exposes the parallelism and communication patterns of stream programs, some analysis is needed to adapt a stream program to a software-exposed processor. We describe a partitioning algorithm that employs fission and fusion transformations to adjust the granularity of a stream graph, a layout algorithm that maps a stream graph to a given network topology, and a scheduling strategy that generates a fine-grained static communication pattern for each computational element.We have implemented a fully functional compiler that parallelizes StreamIt applications for Raw, including several load-balancing transformations. Using the cycle-accurate Raw simulator, we demonstrate that the StreamIt compiler can automatically map a high-level stream abstraction to Raw without losing performance. We consider this work to be a first step towards a portable programming model for communication-exposed architectures.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117197000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 373