2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文中文

COMIC: A coherent shared memory interface for cell BE cell BE的一个一致的共享内存接口

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454157

Jaejin Lee, Sangmin Seo, C. Kim, Junghyun Kim, Posung Chun, Zehra Sura, Jungwon Kim, Sang-Yong Han

The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).

Cell BE处理器是一个异构多核处理器，包含一个PowerPC处理器元素(PPE)和八个协同处理器元素(spe)。每个SPE都有一个小型的软件管理本地存储。应用程序必须显式地控制SPE本地存储和主存之间代码和数据的所有DMA传输，并且它们必须执行传输数据所需的任何一致性操作。对显式内存管理的需求，加上SPE本地存储的有限大小，使得对Cell BE进行编程并实现高性能具有挑战性。在本文中，我们给出了我们的COMIC运行时系统及其编程模型的设计和实现。它为程序提供了一种全局共享内存的假象，其中PPE和每个spe可以访问任何共享数据项，而程序员不必担心数据在哪里，或者如何获得数据。COMIC完全在软件中通过Cell SDK提供的用户级库实现。对于SPE代码中的每个读或写操作，将插入一个COMIC运行时函数来检查数据在其本地存储中是否可用，如果不是，则自动获取数据。我们提出了一个内存一致性模型和一个编程模型，其中同步和一致性的管理集中在PPE中。为了描述COMIC运行时系统的有效性，我们在Cell BE系统和类似smp的同构多核(Xeon)上使用12个OpenMP基准应用程序对其进行了评估。

{"title":"COMIC: A coherent shared memory interface for cell BE","authors":"Jaejin Lee, Sangmin Seo, C. Kim, Junghyun Kim, Posung Chun, Zehra Sura, Jungwon Kim, Sang-Yong Han","doi":"10.1145/1454115.1454157","DOIUrl":"https://doi.org/10.1145/1454115.1454157","url":null,"abstract":"The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 s1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114248095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Scalable and reliable communication for hardware transactional memory 硬件事务性内存的可伸缩和可靠通信

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454137

Seth H. Pugsley, M. Awasthi, Niti Madan, Naveen Muralimanohar, R. Balasubramonian

In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable in terms of delay and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages for commit. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), and 8% (for average transaction length of 4000 cycles). For a small group of multi-threaded programs with frequent transaction commits, improvements of up to 8% were observed for a 32-node simulation.

在具有延迟版本控制和延迟冲突检测的硬件事务性内存系统中，事务提交过程可能成为瓶颈。对于大规模分布式内存系统来说尤其如此，因为多个事务可能试图同时提交，并且在允许并行提交之前需要进行协调。在本文中，我们提出了新的算法来实现提交，这些算法在延迟方面更具可扩展性，并且没有死锁/活锁。我们展示了这些算法与令牌缓存一致性概念有相似之处，并利用这些相似之处扩展算法以处理消息丢失和饥饿场景。所提出的算法在最先进的基础上进行了改进，提交延迟减少了7倍，提交的网络消息减少了48倍。这意味着总体性能提高高达66%(对于平均事务长度为200个周期的合成工作负载)、35%(对于平均事务长度为1000个周期)和8%(对于平均事务长度为4000个周期)。对于一小组事务提交频繁的多线程程序，在32个节点的模拟中可以观察到高达8%的改进。

{"title":"Scalable and reliable communication for hardware transactional memory","authors":"Seth H. Pugsley, M. Awasthi, Niti Madan, Naveen Muralimanohar, R. Balasubramonian","doi":"10.1145/1454115.1454137","DOIUrl":"https://doi.org/10.1145/1454115.1454137","url":null,"abstract":"In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable in terms of delay and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages for commit. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), and 8% (for average transaction length of 4000 cycles). For a small group of multi-threaded programs with frequent transaction commits, improvements of up to 8% were observed for a 32-node simulation.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Visualizing potential parallelism in sequential programs 可视化顺序程序中潜在的并行性

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454129

Graham D. Price, John Giacomoni, Manish Vachharajani

This paper presents ParaMeter, an interactive program analysis and visualization system for large traces. Using ParaMeter, a software developer can locate and analyze regions of code that may yield to parallelization efforts and to possibly extract performance from multicore hardware. The key contributions in the paper are (1) a method to use interactive visualization of traces to find and exploit parallelism, (2) interactive-speed visualization of large-scale trace dependencies, (3) interactive-speed visualization of code interactions, and (4) a BDD variable ordering for BDD-compressed traces that results in fast visualization, fast analysis, and good compression. ParaMeter's effectiveness is demonstrated by finding and exploiting parallelism in 175.vpr. Measurements of ParaMeter's visualization algorithms show that they are up to seventy-five thousand times faster than prior approaches.

本文介绍了一个大型轨迹的交互式程序分析和可视化系统ParaMeter。使用ParaMeter，软件开发人员可以定位和分析可能产生并行化工作的代码区域，并可能从多核硬件中提取性能。本文的主要贡献有:(1)使用轨迹的交互式可视化来发现和利用并行性的方法，(2)大规模轨迹依赖关系的交互式可视化，(3)代码交互的交互式可视化，以及(4)BDD压缩轨迹的BDD变量排序，从而实现快速可视化、快速分析和良好压缩。通过在175.vpr中发现并利用并行性，验证了参数的有效性。ParaMeter可视化算法的测量表明，它们比以前的方法快了7.5万倍。

引用次数: 15

Multi-mode energy management for multi-tier server clusters 多层服务器集群的多模式能源管理

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454153

T. Horvath, K. Skadron

This paper presents an energy management policy for reconfigurable clusters running a multi-tier application, exploiting DVS together with multiple sleep states. We develop a theoretical analysis of the corresponding power optimization problem and design an algorithm around the solution. Moreover, we rigorously investigate selection of the optimal number of spare servers for each power state, a problem that has only been approached in an ad-hoc manner in current policies. To validate our results and policies, we implement them on an actual multi-tier server cluster where nodes support all power management techniques considered. Experimental results using realistic dynamic workloads based on the TPC-W benchmark show that exploiting multiple sleep states results in significant additional cluster-wide energy savings up to 23% with little or no performance degradation.

针对运行多层应用的可重构集群，提出了一种利用分布式交换机和多睡眠状态的能量管理策略。我们对相应的功率优化问题进行了理论分析，并围绕解决方案设计了一种算法。此外，我们严格地研究了每种电源状态下备用服务器的最佳数量的选择，这个问题在当前的策略中只能以一种特殊的方式来解决。为了验证我们的结果和策略，我们在一个实际的多层服务器集群上实现它们，其中节点支持所考虑的所有电源管理技术。使用基于TPC-W基准的真实动态工作负载的实验结果表明，利用多个睡眠状态可以在集群范围内显著节省高达23%的额外能源，而性能几乎没有下降。

引用次数: 124

Feature selection and policy optimization for distributed instruction placement using reinforcement learning 使用强化学习的分布式指令放置的特征选择和策略优化

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454122

Katherine E. Coons, Behnam Robatmili, Matthew E. Taylor, Bertrand A. Maher, D. Burger, K. McKinley

Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions communicate with one another directly on a distributed substrate, give the compiler control over communication overheads at a fine granularity. Prior work shows that compilers can effectively reduce fine-grained communication overheads in EDGE architectures using a spatial instruction placement algorithm with a heuristic-based cost function. While this algorithm is effective, the cost function must be painstakingly tuned. Heuristics tuned to perform well across a variety of applications leave users with little ability to tune performance-critical applications, yet we find that the best placement heuristics vary significantly with the application. First, we suggest a systematic feature selection method that reduces the feature set size based on the extent to which features affect performance. To automatically discover placement heuristics, we then use these features as input to a reinforcement learning technique, called Neuro-Evolution of Augmenting Topologies (NEAT), that uses a genetic algorithm to evolve neural networks. We show that NEAT outperforms simulated annealing, the most commonly used optimization technique for instruction placement. We use NEAT to learn general heuristics that are as effective as hand-tuned heuristics, but we find that improving over highly hand-tuned general heuristics is difficult. We then suggest a hierarchical approach to machine learning that classifies segments of code with similar characteristics and learns heuristics for these classes. This approach performs closer to the specialized heuristics. Together, these results suggest that learning compiler heuristics may benefit from both improved feature selection and classification.

通信开销是多处理器系统的基本挑战之一。随着芯片上处理器数量的增加，通信开销以及计算和数据的分布成为越来越重要的性能因素。显式数据流图执行(Explicit Dataflow Graph Execution, EDGE)处理器，其中指令在分布式基板上直接相互通信，使编译器能够以精细的粒度控制通信开销。先前的研究表明，使用基于启发式成本函数的空间指令放置算法，编译器可以有效地减少EDGE架构中的细粒度通信开销。虽然这种算法是有效的，但代价函数必须经过艰苦的调整。为了在各种应用程序中表现良好而调整的启发式方法使用户几乎没有能力调整性能关键型应用程序，但我们发现最佳放置启发式方法因应用程序而异。首先，我们提出了一种系统的特征选择方法，该方法根据特征对性能的影响程度来减小特征集的大小。为了自动发现放置启发式，我们将这些特征作为强化学习技术的输入，称为增强拓扑的神经进化(NEAT)，该技术使用遗传算法来进化神经网络。我们证明了NEAT优于模拟退火，这是最常用的指令放置优化技术。我们使用NEAT来学习与手动调整的启发式一样有效的一般启发式，但是我们发现要改进高度手动调整的一般启发式是很困难的。然后，我们提出了一种分层的机器学习方法，对具有相似特征的代码片段进行分类，并为这些类学习启发式方法。这种方法更接近于专门的启发式方法。总之，这些结果表明，学习编译器启发式可能受益于改进的特征选择和分类。

{"title":"Feature selection and policy optimization for distributed instruction placement using reinforcement learning","authors":"Katherine E. Coons, Behnam Robatmili, Matthew E. Taylor, Bertrand A. Maher, D. Burger, K. McKinley","doi":"10.1145/1454115.1454122","DOIUrl":"https://doi.org/10.1145/1454115.1454122","url":null,"abstract":"Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions communicate with one another directly on a distributed substrate, give the compiler control over communication overheads at a fine granularity. Prior work shows that compilers can effectively reduce fine-grained communication overheads in EDGE architectures using a spatial instruction placement algorithm with a heuristic-based cost function. While this algorithm is effective, the cost function must be painstakingly tuned. Heuristics tuned to perform well across a variety of applications leave users with little ability to tune performance-critical applications, yet we find that the best placement heuristics vary significantly with the application. First, we suggest a systematic feature selection method that reduces the feature set size based on the extent to which features affect performance. To automatically discover placement heuristics, we then use these features as input to a reinforcement learning technique, called Neuro-Evolution of Augmenting Topologies (NEAT), that uses a genetic algorithm to evolve neural networks. We show that NEAT outperforms simulated annealing, the most commonly used optimization technique for instruction placement. We use NEAT to learn general heuristics that are as effective as hand-tuned heuristics, but we find that improving over highly hand-tuned general heuristics is difficult. We then suggest a hierarchical approach to machine learning that classifies segments of code with similar characteristics and learns heuristics for these classes. This approach performs closer to the specialized heuristics. Together, these results suggest that learning compiler heuristics may benefit from both improved feature selection and classification.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124309502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Exploiting loop-dependent Stream Reuse for stream processors 为流处理器开发依赖循环的流重用

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454121

Xuejun Yang, Y. Zhang, Jingling Xue, Ian Rogers, Gen Li, Guibin Wang

The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.

内存访问限制了流处理器的性能。通过利用流寄存器文件(SRF)中保存的数据的重用，芯片上的存储，可以减少内存访问的数量。在当前的流编译器中，只尝试对开始和结束都已知的简单流引用进行重用。来自流处理器外部的编译器分析不能直接考虑其他复杂的流引用。在本文中，我们提出了一种转换来自动优化流程序，以利用依赖循环的流引用提供的重用。该转换基于三个结果:识别流引用提供的重用的算法，描述重用的流重用图(SRG)的新抽象表达式以及SRG对转换的优化。本文既利用了流引用访问的整个序列之间的重用，也利用了部分序列之间的重用。特别是，利用部分流重用的问题在传统的数据重用利用设置(对于标量和数组)中没有其并行性。最后，我们使用Imagine的StreamC/KernelC编译器实现了我们的技术。实验结果表明，使用一系列典型的流处理应用程序内核，最终的速度提高了1.14到2.54倍。

{"title":"Exploiting loop-dependent Stream Reuse for stream processors","authors":"Xuejun Yang, Y. Zhang, Jingling Xue, Ian Rogers, Gen Li, Guibin Wang","doi":"10.1145/1454115.1454121","DOIUrl":"https://doi.org/10.1145/1454115.1454121","url":null,"abstract":"The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"20 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132090961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Characterizing and modeling the behavior of context switch misses! 描述和建模上下文切换错误的行为!

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454130

Fang Liu, Fei Guo, Yan Solihin, Seongbeom Kim, A. Eker

One of the essential features in modern computer systems is context switching, which allows multiple threads of execution to time-share a limited number of processors. While very useful, context switching can introduce high performance overheads, with one of the primary reasons being the cache perturbation effect. Between the time a thread is switched out and when it resumes execution, parts of its working set in the cache may be perturbed by other interfering threads, leading to (context switch) cache misses to recover from the perturbation. The goal of this paper is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. We characterize a previously-unreported type of context switch misses that occur as the artifact of the interaction of cache replacement policy and an application's temporal reuse behavior. We characterize the behavior of these “reordered misses” for various applications, cache sizes, and the amount of cache perturbation. As a second contribution, we develop an analytical model that reveals the mathematical relationship between cache design parameters, an application's temporal reuse pattern, and the number of context switch misses the application suffers from. We validate the model against simulation studies and find that it is accurate in predicting the trends of context switch misses. The mathematical relationship provided by the model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others. Through a case study, we also find that prefetching tends to aggravate the number of context switch misses.

现代计算机系统的基本特征之一是上下文切换，它允许多个执行线程分时共享有限数量的处理器。虽然上下文切换非常有用，但它会带来很高的性能开销，主要原因之一是缓存扰动效应。在线程被切换出和恢复执行之间，缓存中的部分工作集可能会受到其他干扰线程的干扰，导致(上下文切换)缓存丢失从干扰中恢复。本文的目标是了解缓存参数和应用程序行为如何影响应用程序遭受的上下文切换丢失的数量。我们将以前未报告的上下文切换缺失类型描述为缓存替换策略和应用程序临时重用行为交互的工件。我们描述了不同应用程序、缓存大小和缓存扰动量下这些“重排序失误”的行为。作为第二个贡献，我们开发了一个分析模型，该模型揭示了缓存设计参数、应用程序的临时重用模式和应用程序遭受的上下文切换错过次数之间的数学关系。我们通过仿真研究验证了该模型，发现它在预测上下文切换失误的趋势方面是准确的。该模型提供的数学关系使我们能够深入了解为什么某些应用程序比其他应用程序更容易受到上下文切换错误的影响。通过实例分析，我们还发现预取往往会加剧上下文切换失误的数量。

{"title":"Characterizing and modeling the behavior of context switch misses!","authors":"Fang Liu, Fei Guo, Yan Solihin, Seongbeom Kim, A. Eker","doi":"10.1145/1454115.1454130","DOIUrl":"https://doi.org/10.1145/1454115.1454130","url":null,"abstract":"One of the essential features in modern computer systems is context switching, which allows multiple threads of execution to time-share a limited number of processors. While very useful, context switching can introduce high performance overheads, with one of the primary reasons being the cache perturbation effect. Between the time a thread is switched out and when it resumes execution, parts of its working set in the cache may be perturbed by other interfering threads, leading to (context switch) cache misses to recover from the perturbation. The goal of this paper is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. We characterize a previously-unreported type of context switch misses that occur as the artifact of the interaction of cache replacement policy and an application's temporal reuse behavior. We characterize the behavior of these “reordered misses” for various applications, cache sizes, and the amount of cache perturbation. As a second contribution, we develop an analytical model that reveals the mathematical relationship between cache design parameters, an application's temporal reuse pattern, and the number of context switch misses the application suffers from. We validate the model against simulation studies and find that it is accurate in predicting the trends of context switch misses. The mathematical relationship provided by the model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others. Through a case study, we also find that prefetching tends to aggravate the number of context switch misses.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"512 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

An Adaptive Resource Partitioning Algorithm for SMT processors 一种SMT处理器自适应资源分配算法

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454148

Huaping Wang, I. Koren, C. M. Krishna

Simultaneous Multithreading (SMT) increases processor throughput by allowing the parallel execution of several threads. However, fully sharing processor resources may cause resource monopolization by a single thread or other misallocations, resulting in overall performance degradation. Static resource partitioning techniques have been suggested, but are not as effective as dynamically controlling the resource usage of each thread since program behavior does change during its execution. In this paper, we propose an Adaptive Resource Partitioning Algorithm (ARPA) that dynamically assigns resources to threads according to thread behavior changes. ARPA analyzes the resource usage efficiency of each thread in a time period and assigns more resources to threads which can use them in a more efficient way. The purpose of ARPA is to improve the efficiency of resource utilization, thereby improving overall instruction throughput. Our simulation results on a set of 42 multiprogramming workloads show that ARPA outperforms the traditional fetch policy ICOUNT by 55.8% with regard to overall instruction throughput and achieves a 33.8% improvement over Static Partitioning. It also outperforms the current best dynamic resource allocation technique, Hill-climbing, by 5.7%. Considering fairness accorded to each thread, ARPA attains 43.6%, 18.5% and 9.2% improvements over ICOUNT, Static Partitioning and Hill-climbing, respectively, using a common fairness metric.

同步多线程(SMT)通过允许多个线程并行执行来提高处理器吞吐量。但是，完全共享处理器资源可能会导致单个线程独占资源或其他错误分配，从而导致整体性能下降。有人建议使用静态资源分区技术，但它不如动态控制每个线程的资源使用有效，因为程序的行为在执行过程中会发生变化。本文提出了一种根据线程行为变化动态分配资源的自适应资源分配算法(ARPA)。ARPA分析每个线程在一段时间内的资源使用效率，并将更多的资源分配给可以更有效地使用资源的线程。ARPA的目的是提高资源利用效率，从而提高总体指令吞吐量。我们在一组42个多路编程工作负载上的模拟结果表明，ARPA在总体指令吞吐量方面比传统的抓取策略ICOUNT高出55.8%，比静态分区高出33.8%。它还比目前最好的动态资源分配技术爬坡(hill -climb)高出5.7%。考虑到每个线程的公平性，使用通用的公平性指标，ARPA比ICOUNT、静态分区和爬坡分别提高了43.6%、18.5%和9.2%。

{"title":"An Adaptive Resource Partitioning Algorithm for SMT processors","authors":"Huaping Wang, I. Koren, C. M. Krishna","doi":"10.1145/1454115.1454148","DOIUrl":"https://doi.org/10.1145/1454115.1454148","url":null,"abstract":"Simultaneous Multithreading (SMT) increases processor throughput by allowing the parallel execution of several threads. However, fully sharing processor resources may cause resource monopolization by a single thread or other misallocations, resulting in overall performance degradation. Static resource partitioning techniques have been suggested, but are not as effective as dynamically controlling the resource usage of each thread since program behavior does change during its execution. In this paper, we propose an Adaptive Resource Partitioning Algorithm (ARPA) that dynamically assigns resources to threads according to thread behavior changes. ARPA analyzes the resource usage efficiency of each thread in a time period and assigns more resources to threads which can use them in a more efficient way. The purpose of ARPA is to improve the efficiency of resource utilization, thereby improving overall instruction throughput. Our simulation results on a set of 42 multiprogramming workloads show that ARPA outperforms the traditional fetch policy ICOUNT by 55.8% with regard to overall instruction throughput and achieves a 33.8% improvement over Static Partitioning. It also outperforms the current best dynamic resource allocation technique, Hill-climbing, by 5.7%. Considering fairness accorded to each thread, ARPA attains 43.6%, 18.5% and 9.2% improvements over ICOUNT, Static Partitioning and Hill-climbing, respectively, using a common fairness metric.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134056352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Analysis and approximation of optimal co-scheduling on Chip Multiprocessors 芯片多处理器上最优协同调度的分析与逼近

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454146

Yunlian Jiang, Xipeng Shen, Jie Chen, Rahul Tripathi

Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate the contention, but it remains an open question how to efficiently find optimal co-schedules. Solving the question is critical for determining the potential of a co-scheduling system. This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness. Furthermore, for a special case when there are two sharers per chip, we propose an algorithm that finds the optimal co-schedules in polynomial time. For more complex cases, we design and evaluate a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability. This study facilitates the evaluation of co-scheduling systems, as well as offers some techniques directly usable in proactive job co-scheduling.

处理器间的缓存共享是芯片多处理器减少线程间延迟的重要手段，但同时也带来了缓存争用，大大降低了程序性能。近年来的研究表明，作业协同调度可以有效地缓解竞争，但如何有效地找到最优的协同调度仍然是一个悬而未决的问题。解决这个问题对于确定协同调度系统的潜力至关重要。本文对协同调度的复杂度进行了理论分析，证明了协同调度的np完备性。此外，对于每个芯片有两个共享者的特殊情况，我们提出了一种在多项式时间内找到最优协同调度的算法。对于更复杂的情况，我们设计并评估了一系列近似算法，其中层次匹配算法产生了接近最优的调度，并具有良好的可扩展性。本研究促进了协同调度系统的评估，并提供了一些可直接用于主动作业协同调度的技术。

引用次数: 166

The PARSEC benchmark suite: Characterization and architectural implications PARSEC基准测试套件:特性描述和体系结构含义

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454128

Christian Bienia, Sanjeev Kumar, J. Singh, Kai Li

This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

本文介绍并描述了普林斯顿共享内存计算机应用程序库(PARSEC)，这是一个用于研究芯片多处理器(cmp)的基准套件。以前可用的多处理器基准测试侧重于高性能计算应用程序，并使用了数量有限的同步方法。PARSEC包括识别、挖掘和合成(RMS)方面的新兴应用，以及模拟大规模多线程商业程序的系统应用。我们的表征表明，基准套件涵盖了广泛的工作集、局部性、数据共享、同步和片外流量。基准测试套件已经向公众开放。

引用次数: 3540

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀