首页 > 最新文献

2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

英文 中文
Programming Strategies for GPUs and their Power Consumption gpu的编程策略及其功耗
Sayan Ghosh, B. Chapman
GPUs are slowly becoming ubiquitous devices in high performance computing. Nvidia's newly released version 4.0 of the CUDA API[2] for GPU programming offers multiple ways to program on GPUs and emphasizes on Multi-GPU environments which are common in modern day compute clusters. However, despite of the subsequent progress in FLOP counts, the bane of large scale computing systems have been increased energy consumption and cooling costs. Since the energy (power X time) of a system has an obvious correlation with the user program, hence different programming techniques on GPUs could have a relation to the overall system energy consumption.
gpu正在逐渐成为高性能计算中无处不在的设备。Nvidia最新发布的GPU编程CUDA API[2] 4.0版本提供了多种GPU编程方式,并强调了现代计算集群中常见的多GPU环境。然而,尽管在FLOP计数方面取得了进展,但大规模计算系统的祸根已经增加了能源消耗和冷却成本。由于系统的能量(功率X时间)与用户程序有明显的相关性,因此gpu上不同的编程技术可能与整个系统的能耗有关。
{"title":"Programming Strategies for GPUs and their Power Consumption","authors":"Sayan Ghosh, B. Chapman","doi":"10.1109/PACT.2011.51","DOIUrl":"https://doi.org/10.1109/PACT.2011.51","url":null,"abstract":"GPUs are slowly becoming ubiquitous devices in high performance computing. Nvidia's newly released version 4.0 of the CUDA API[2] for GPU programming offers multiple ways to program on GPUs and emphasizes on Multi-GPU environments which are common in modern day compute clusters. However, despite of the subsequent progress in FLOP counts, the bane of large scale computing systems have been increased energy consumption and cooling costs. Since the energy (power X time) of a system has an obvious correlation with the user program, hence different programming techniques on GPUs could have a relation to the overall system energy consumption.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"31 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115709461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory 症候:使用硬件事务性内存进行基于症候的错误检测和恢复
Gulay Yalcin, O. Unsal, A. Cristal, I. Hur, M. Valero
Fault-tolerance has become an essential concern for processor designers due to increasing transient and permanent fault rates. In this study we propose Symptom TM, a symptom-based error detection technique that recovers from errors by leveraging the abort mechanism of Transactional Memory (TM). To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory (HTM). Symptom TM can recover from 86% and 65% of catastrophic failures caused by transient and permanent errors respectively with no performance overhead in error-free executions.
由于瞬态和永久故障率的增加,容错性已经成为处理器设计者关注的重要问题。在这项研究中,我们提出了症状TM,一种基于症状的错误检测技术,通过利用事务性记忆(TM)的中止机制从错误中恢复。据我们所知,这是第一个使用硬件事务性内存(Hardware Transactional Memory, HTM)的架构容错建议。TM可以分别从86%和65%由瞬态错误和永久错误引起的灾难性故障中恢复,并且在无错误执行中没有性能开销。
{"title":"SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory","authors":"Gulay Yalcin, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1109/PACT.2011.39","DOIUrl":"https://doi.org/10.1109/PACT.2011.39","url":null,"abstract":"Fault-tolerance has become an essential concern for processor designers due to increasing transient and permanent fault rates. In this study we propose Symptom TM, a symptom-based error detection technique that recovers from errors by leveraging the abort mechanism of Transactional Memory (TM). To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory (HTM). Symptom TM can recover from 86% and 65% of catastrophic failures caused by transient and permanent errors respectively with no performance overhead in error-free executions.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116292259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
POPS: Coherence Protocol Optimization for Both Private and Shared Data 私有数据和共享数据的一致性协议优化
Hemayet Hossain, S. Dwarkadas, Michael C. Huang
As the number of cores in a chip multiprocessor (CMP) increases, the need for larger on-chip caches also increases in order to avoid creating a bottleneck at the off-chip interconnect. Utilization of these CMPs include combinations of multithreading and multiprogramming, showing a range of sharing behavior, from frequent inter-thread communication to no communication. The goal of the CMP cache design is to maximize capacity for a given size while providing as low a latency as possible for the entire range of sharing behavior. In a typical CMP design, the last level cache (LLC) is shared across the cores and incurs a latency of access that is a function of distance on the chip. Sharing helps avoid the need for replicas at the LLC and allows access to the entire on-chip cache space by any core. However, the cost is the increased latency of communication based on where data is mapped on the chip. In this paper, we propose a cache coherence design we call POPS that provides localized data and metadata access for both shared data (in multithreaded workloads) and private data (predominant in multiprogrammed workloads). POPS achieves its goal by (1) decoupling data and metadata, allowing both to be delegated to local LLC slices for private data and between sharers for shared data, (2) freeing delegated data storage in the LLC for larger effective capacity, and (3) changing the delegation and/or coherence protocol action based on the observed sharing pattern. Our analysis on an execution-driven full system simulator using multithreaded and multiprogrammed workloads shows that POPS performs 42% (28% without micro benchmarks) better for multithreaded workloads, 16% better for multiprogrammed workloads, and 8% better when one single-threaded application is the only running process, compared to the base non-uniform shared L2 protocol. POPS has the added benefits of reduced on-chip and off-chip traffic and reduced dynamic energy consumption.
随着芯片多处理器(CMP)中内核数量的增加,为了避免在片外互连上造成瓶颈,对更大的片内缓存的需求也在增加。这些cmp的使用包括多线程和多道编程的组合,显示了一系列共享行为,从频繁的线程间通信到没有通信。CMP缓存设计的目标是在给定大小下最大化容量,同时为整个共享行为范围提供尽可能低的延迟。在典型的CMP设计中,最后一级缓存(LLC)在内核之间共享,并导致访问延迟,这是芯片上距离的函数。共享有助于避免在LLC上需要副本,并允许任何核心访问整个片上缓存空间。然而,代价是基于数据在芯片上的映射位置而增加的通信延迟。在本文中,我们提出了一种缓存一致性设计,我们称之为pop,它为共享数据(在多线程工作负载中)和私有数据(在多程序工作负载中占主导地位)提供本地化数据和元数据访问。POPS通过以下方式实现了其目标:(1)解耦数据和元数据,允许将私有数据委托给本地LLC切片,并在共享者之间委托给共享数据;(2)释放LLC中委托的数据存储以获得更大的有效容量;(3)根据观察到的共享模式改变委托和/或一致性协议操作。我们对使用多线程和多编程工作负载的执行驱动的全系统模拟器的分析表明,与基本的非统一共享L2协议相比,POPS在多线程工作负载下的性能提高了42%(没有微基准测试时为28%),在多编程工作负载下的性能提高了16%,在单线程应用程序是唯一运行的进程时的性能提高了8%。pop还具有减少片内和片外流量以及降低动态能耗的额外好处。
{"title":"POPS: Coherence Protocol Optimization for Both Private and Shared Data","authors":"Hemayet Hossain, S. Dwarkadas, Michael C. Huang","doi":"10.1109/PACT.2011.11","DOIUrl":"https://doi.org/10.1109/PACT.2011.11","url":null,"abstract":"As the number of cores in a chip multiprocessor (CMP) increases, the need for larger on-chip caches also increases in order to avoid creating a bottleneck at the off-chip interconnect. Utilization of these CMPs include combinations of multithreading and multiprogramming, showing a range of sharing behavior, from frequent inter-thread communication to no communication. The goal of the CMP cache design is to maximize capacity for a given size while providing as low a latency as possible for the entire range of sharing behavior. In a typical CMP design, the last level cache (LLC) is shared across the cores and incurs a latency of access that is a function of distance on the chip. Sharing helps avoid the need for replicas at the LLC and allows access to the entire on-chip cache space by any core. However, the cost is the increased latency of communication based on where data is mapped on the chip. In this paper, we propose a cache coherence design we call POPS that provides localized data and metadata access for both shared data (in multithreaded workloads) and private data (predominant in multiprogrammed workloads). POPS achieves its goal by (1) decoupling data and metadata, allowing both to be delegated to local LLC slices for private data and between sharers for shared data, (2) freeing delegated data storage in the LLC for larger effective capacity, and (3) changing the delegation and/or coherence protocol action based on the observed sharing pattern. Our analysis on an execution-driven full system simulator using multithreaded and multiprogrammed workloads shows that POPS performs 42% (28% without micro benchmarks) better for multithreaded workloads, 16% better for multiprogrammed workloads, and 8% better when one single-threaded application is the only running process, compared to the base non-uniform shared L2 protocol. POPS has the added benefits of reduced on-chip and off-chip traffic and reduced dynamic energy consumption.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131553212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU 在为CPU编译细粒度spmd线程程序时正确处理同步
Ziyu Guo, E. Zhang, Xipeng Shen
Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPUs. It points out some correctness pitfalls in existing techniques, particularly in their treatment to implicit synchronizations. It then describes a systematic dependence analysis specially designed for handling implicit synchronizations in SPMD-threaded programs. By unveiling the relations between inter-thread data dependences and correct treatment to synchronizations, it presents a dependence-based solution to the problem. Experiments demonstrate that the proposed techniques can resolve the correctness issues in existing compilation techniques, and help compilers produce correct and efficient translation results.
多类型设备的自动编译非常重要,特别是考虑到当前异构计算的趋势。本文集中讨论了在多核cpu上编译细粒度spmd线程代码(例如GPU CUDA代码)的一些问题。它指出了现有技术中的一些正确性缺陷,特别是在处理隐式同步方面。然后描述了专门为处理spmd线程程序中的隐式同步而设计的系统依赖性分析。通过揭示线程间数据依赖和正确处理同步之间的关系,它提供了一个基于依赖的问题解决方案。实验表明,本文提出的技术可以解决现有编译技术中的正确性问题,帮助编译器生成正确高效的翻译结果。
{"title":"Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU","authors":"Ziyu Guo, E. Zhang, Xipeng Shen","doi":"10.1109/PACT.2011.62","DOIUrl":"https://doi.org/10.1109/PACT.2011.62","url":null,"abstract":"Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPUs. It points out some correctness pitfalls in existing techniques, particularly in their treatment to implicit synchronizations. It then describes a systematic dependence analysis specially designed for handling implicit synchronizations in SPMD-threaded programs. By unveiling the relations between inter-thread data dependences and correct treatment to synchronizations, it presents a dependence-based solution to the problem. Experiments demonstrate that the proposed techniques can resolve the correctness issues in existing compilation techniques, and help compilers produce correct and efficient translation results.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128333183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Probabilistic Models Towards Optimal Speculation of DFA Applications DFA应用中最优推测的概率模型
Zhijia Zhao, Bo Wu
Applications based on Deterministic Finite Automata (DFA) are important for many tasks, including lexing in web browsers, routing in networks, decoding in cryptography and so on. The efficiency of these applications are often critical, but parallelizing them is difficult due to strong dependences among states. Recent years have seen some employment of speculative execution to address that problem. Even though some promising results have been shown, existing designs are all static, lack of the capability to adapt to specific DFA applications and inputs to maximize the speculation benefits. In this work, we initiate an exploration to the inherent relations between the design of speculation schemes and the properties of DFA and inputs. After revealing some theoretical findings in the relations, we develop a model-based approach to maximizing the performance of speculatively executed DFA-based applications. Experiments demonstrate that the developed techniques can accelerate speculative executions by a factor of integers compared to the state-of-the-art techniques.
基于确定性有限自动机(Deterministic Finite Automata, DFA)的应用在许多任务中都很重要,包括浏览器中的词法分析、网络中的路由、密码学中的解码等。这些应用程序的效率通常是至关重要的,但由于状态之间的依赖性很强,很难并行化它们。近年来出现了一些利用投机执行来解决这个问题的做法。尽管已经显示了一些有希望的结果,但现有的设计都是静态的,缺乏适应特定DFA应用和输入的能力,无法最大限度地提高推测收益。在这项工作中,我们开始探索投机方案的设计与DFA和输入的性质之间的内在关系。在揭示了关系中的一些理论发现之后,我们开发了一种基于模型的方法来最大化推测执行的基于dfa的应用程序的性能。实验表明,与目前的技术相比,所开发的技术可以将投机执行速度提高一个整数倍。
{"title":"Probabilistic Models Towards Optimal Speculation of DFA Applications","authors":"Zhijia Zhao, Bo Wu","doi":"10.1109/PACT.2011.53","DOIUrl":"https://doi.org/10.1109/PACT.2011.53","url":null,"abstract":"Applications based on Deterministic Finite Automata (DFA) are important for many tasks, including lexing in web browsers, routing in networks, decoding in cryptography and so on. The efficiency of these applications are often critical, but parallelizing them is difficult due to strong dependences among states. Recent years have seen some employment of speculative execution to address that problem. Even though some promising results have been shown, existing designs are all static, lack of the capability to adapt to specific DFA applications and inputs to maximize the speculation benefits. In this work, we initiate an exploration to the inherent relations between the design of speculation schemes and the properties of DFA and inputs. After revealing some theoretical findings in the relations, we develop a model-based approach to maximizing the performance of speculatively executed DFA-based applications. Experiments demonstrate that the developed techniques can accelerate speculative executions by a factor of integers compared to the state-of-the-art techniques.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132799097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Unified Scheduler for Recursive and Task Dataflow Parallelism 用于递归和任务数据流并行的统一调度程序
H. Vandierendonck, George Tzenakis, Dimitrios S. Nikolopoulos
Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyper object. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.
任务数据流语言通过动态检测和执行任务之间的依赖关系,简化了并行程序的规范。然而,这些语言通常被限制在单一的并行级别。这种语言设计反映在运行时系统中,其中主线程显式地生成任务图,工作线程执行就绪任务并唤醒它们的依赖项。这种方法与最先进的调度器(如Cilk调度器)不兼容,后者最大限度地减少空闲任务的创建(工作优先原则),并将所有任务创建和调度置于关键路径之外。为了协调任务依赖和工作优先原则,本文提出了对Cilk调度程序的扩展。我们将讨论任务依赖性对Cilk调度器属性的影响。此外,我们提出了一种低开销的基于票据的技术,用于对象级别的依赖跟踪和实施。我们的调度器还支持重命名对象,以增加任务级的并行性。重命名是使用版本化对象实现的,这是一种新型的超对象。实验评估表明,当任务没有依赖关系时,统一调度程序的效率与Cilk调度程序一样高。此外,统一调度器比任务数据流语言的特定实现SMPSS更有效。
{"title":"A Unified Scheduler for Recursive and Task Dataflow Parallelism","authors":"H. Vandierendonck, George Tzenakis, Dimitrios S. Nikolopoulos","doi":"10.1109/PACT.2011.7","DOIUrl":"https://doi.org/10.1109/PACT.2011.7","url":null,"abstract":"Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyper object. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"s3-28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130123812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory DiDi:使用共享的TLB目录减轻TLB宕机对性能的影响
Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal
Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.
翻译暂置缓冲区(tlb)在现代体系结构中普遍用于缓存虚拟到物理的映射,并且由于它们在每次内存访问时都被查找,因此对性能可伸缩性至关重要。随着具有单核TLB的芯片多处理器(cmp)的出现,TLB的一致性问题被提上了前台。tlb由操作系统(OS)在软件级别保持一致。每当操作系统修改页表中的页权限时,它必须启动TLB之间的一致性事务,这个过程称为TLB shoot down。当前的cmp依赖于操作系统来近似缓存映射的tlb集合,并使用昂贵的处理器间中断(ipi)和软件处理程序来同步tlb。在本文中,我们描述了TLB故障对多处理器性能和可扩展性的影响,并提出了一种可扩展的TLB一致性机制的设计。首先,我们证明了TLB故障成本和频率都随着处理器数量的增加而增加,并预测基于软件的TLB故障会阻碍大型多处理器的性能。然后,我们提出了一种可扩展的体系结构机制,该机制将共享TLB目录与负载/存储队列耦合在一起,以支持轻量级TLB失效,从而消除了对昂贵的ip的需求。最后,我们表明,所提出的机制减少了TLB击落上浪费的机器周期的一个数量级。
{"title":"DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory","authors":"Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal","doi":"10.1109/PACT.2011.65","DOIUrl":"https://doi.org/10.1109/PACT.2011.65","url":null,"abstract":"Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114849184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update 使用Lamport时钟延迟更新的MPI程序的大规模验证
Anh Vo, G. Gopalakrishnan, R. Kirby, B. Supinski, M. Schulz, G. Bronevetsky
We propose a dynamic verification approach for large-scale message passing programs to locate correctness bugs caused by unforeseen nondeterministic interactions. This approach hinges on an efficient protocol to track the causality between nondeterministic message receive operations and potentially matching send operations. We show that causality tracking protocols that rely solely on logical clocks fail to capture all nuances of MPI program behavior, including the variety of ways in which nonblocking calls can complete. Our approach is hinged on formally defining the matches-before relation underlying the MPI standard, and devising lazy update logical clock based algorithms that can correctly discover all potential outcomes of nondeterministic receives in practice. can achieve the same coverage as a vector clock based algorithm while maintaining good scalability. LLCP allows us to analyze realistic MPI programs involving a thousand MPI processes, incurring only modest overheads in terms of communication bandwidth, latency, and memory consumption.
我们提出了一种针对大规模消息传递程序的动态验证方法,以定位由不可预见的不确定性交互引起的正确性错误。这种方法依赖于一个有效的协议来跟踪不确定的消息接收操作和潜在匹配的发送操作之间的因果关系。我们表明,仅依赖于逻辑时钟的因果关系跟踪协议无法捕获MPI程序行为的所有细微差别,包括非阻塞调用可以完成的各种方式。我们的方法依赖于正式定义MPI标准下的match -before关系,并设计基于延迟更新逻辑时钟的算法,该算法可以在实践中正确发现不确定性接收的所有潜在结果。可以实现与基于矢量时钟的算法相同的覆盖范围,同时保持良好的可扩展性。LLCP允许我们分析涉及一千个MPI进程的实际MPI程序,在通信带宽、延迟和内存消耗方面只产生适度的开销。
{"title":"Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update","authors":"Anh Vo, G. Gopalakrishnan, R. Kirby, B. Supinski, M. Schulz, G. Bronevetsky","doi":"10.1109/PACT.2011.64","DOIUrl":"https://doi.org/10.1109/PACT.2011.64","url":null,"abstract":"We propose a dynamic verification approach for large-scale message passing programs to locate correctness bugs caused by unforeseen nondeterministic interactions. This approach hinges on an efficient protocol to track the causality between nondeterministic message receive operations and potentially matching send operations. We show that causality tracking protocols that rely solely on logical clocks fail to capture all nuances of MPI program behavior, including the variety of ways in which nonblocking calls can complete. Our approach is hinged on formally defining the matches-before relation underlying the MPI standard, and devising lazy update logical clock based algorithms that can correctly discover all potential outcomes of nondeterministic receives in practice. can achieve the same coverage as a vector clock based algorithm while maintaining good scalability. LLCP allows us to analyze realistic MPI programs involving a thousand MPI processes, incurring only modest overheads in terms of communication bandwidth, latency, and memory consumption.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125820017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Improving Run-Time Scheduling for General-Purpose Parallel Code 改进通用并行代码的运行时调度
Alexandros Tzannes, R. Barua, U. Vishkin
Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.
今天,几乎所有的台式机和笔记本电脑都是多核共享内存,但它们运行的代码绝大多数是串行的。高级语言扩展和库(例如,Open MP, cilk++, TBB)使程序员比以前的方法(例如,MPI)更容易编写并行代码,这在很大程度上要归功于高效的{em work-stealing}调度器,它允许程序员暴露比实际硬件并行性更多的并行性。但是,当并行任务太短或太多时,调度开销就会变得很大,并影响性能。因为这种情况经常发生(例如,数据并行,PRAM算法),程序员需要通过将许多任务组合成更长的任务来手动粗化任务以提高性能。但是手动粗化通常会导致代码过度拟合用于粗化的输入数据、平台和上下文,并损害性能可移植性。我们建议区分两种类型的粗化,并使用不同的技术。然后改进我们之前关于Lazy Binary Splitting (LBS)的工作,这是一个调度程序,可以动态执行第二种类型的粗化,但无法在大型商业多核上进行扩展。我们改进的调度器,宽度优先延迟调度(BF-LS)克服了LBS的可伸缩性问题,在大型机器上的性能要好得多。
{"title":"Improving Run-Time Scheduling for General-Purpose Parallel Code","authors":"Alexandros Tzannes, R. Barua, U. Vishkin","doi":"10.1109/PACT.2011.49","DOIUrl":"https://doi.org/10.1109/PACT.2011.49","url":null,"abstract":"Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123295187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence 无硬件缓存一致性的同构多核OpenCL框架
Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, Jaejin Lee
Recently, Intel has introduced a research prototype many core processor called the Single-chip Cloud Computer (SCC). The SCC is an experimental processor created by Intel Labs. It contains 48 cores in a single chip and each core has its own L1 and L2 caches without any hardware support for cache coherence. It allows maximum 64GB size of external memory that can be accessed by all cores and each core dynamically maps the external memory into their own address space. In this paper, we introduce the design and implementation of an OpenCL framework (i.e., runtime and compiler) for such many core architectures with no hardware cache coherence. We have found that the OpenCL coherence and consistency model fits well with the SCC architecture. The OpenCL's weak memory consistency model requires relatively small amount of messages and coherence actions to guarantee coherence and consistency between the memory blocks in the SCC. The dynamic memory mapping mechanism enables our framework to preserve the semantics of the buffer object operations in OpenCL with a small overhead. We have implemented the proposed OpenCL runtime and compiler and evaluate their performance on the SCC with OpenCL applications.
最近,英特尔推出了一款名为“单芯片云计算机”(SCC)的多核处理器研究原型。SCC是一个由英特尔实验室创建的实验性处理器。它在单个芯片中包含48个内核,每个内核都有自己的L1和L2缓存,没有任何硬件支持缓存一致性。它允许所有核心访问最大64GB的外部内存,每个核心动态地将外部内存映射到自己的地址空间。在本文中,我们介绍了一个OpenCL框架(即运行时和编译器)的设计和实现,用于许多没有硬件缓存一致性的核心架构。我们发现OpenCL的连贯性和一致性模型非常适合SCC架构。OpenCL的弱内存一致性模型需要相对少量的消息和一致性操作来保证SCC中内存块之间的一致性和一致性。动态内存映射机制使我们的框架能够以很小的开销保留OpenCL中缓冲区对象操作的语义。我们已经实现了建议的OpenCL运行时和编译器,并评估了它们在SCC上与OpenCL应用程序的性能。
{"title":"An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence","authors":"Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, Jaejin Lee","doi":"10.1109/PACT.2011.12","DOIUrl":"https://doi.org/10.1109/PACT.2011.12","url":null,"abstract":"Recently, Intel has introduced a research prototype many core processor called the Single-chip Cloud Computer (SCC). The SCC is an experimental processor created by Intel Labs. It contains 48 cores in a single chip and each core has its own L1 and L2 caches without any hardware support for cache coherence. It allows maximum 64GB size of external memory that can be accessed by all cores and each core dynamically maps the external memory into their own address space. In this paper, we introduce the design and implementation of an OpenCL framework (i.e., runtime and compiler) for such many core architectures with no hardware cache coherence. We have found that the OpenCL coherence and consistency model fits well with the SCC architecture. The OpenCL's weak memory consistency model requires relatively small amount of messages and coherence actions to guarantee coherence and consistency between the memory blocks in the SCC. The dynamic memory mapping mechanism enables our framework to preserve the semantics of the buffer object operations in OpenCL with a small overhead. We have implemented the proposed OpenCL runtime and compiler and evaluate their performance on the SCC with OpenCL applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124002271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2011 International Conference on Parallel Architectures and Compilation Techniques
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1