首页 > 最新文献

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文 中文
A software-SVM-based transactional memory for multicore accelerator architectures with local memory 一种基于软件svm的事务性内存,用于带有本地内存的多核加速器体系结构
Jun Lee, Sangmin Seo, Jaejin Lee
We propose a software transactional memory (STM) for heterogeneous multicores with small local memory. The heterogeneous multicore architecture consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE is typically backed by a deep, on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small, explicitly addressed local memory that is not coherent with the main memory. Programmers of such multicore architectures suffer from explicit memory management and coherence problems. The STM for such multicores can alleviate the burden of the programmer and transparently handle data transfers at run time. Moreover, it makes the programmer free from controlling locks. Our TM is based on an existing software SVM for the accelerator architecture. The software SVM exploits software-managed caches and coherence protocols between the GPE and APEs. We also propose an optimization technique, called abort prediction, for the TM. It blocks a transaction from running until the chance of potential conflicts is eliminated. We implement the TM system and the optimization technique for a single Cell BE processor and evaluate their effectiveness with six compute-intensive benchmark applications.
我们提出了一种具有小本地内存的异构多核软件事务性内存(STM)。异构多核体系结构由一个通用处理器元素(GPE)和多个加速器处理器元素(ape)组成。GPE通常由深层的片上缓存层次结构和硬件缓存一致性支持。另一方面,类人猿具有与主内存不一致的小的、显式寻址的本地内存。这种多核架构的程序员会遭受显式内存管理和一致性问题的困扰。这种多核的STM可以减轻程序员的负担,并在运行时透明地处理数据传输。此外,它使程序员免于控制锁。我们的TM是基于现有的软件支持向量机的加速器架构。软件SVM利用软件管理的缓存和GPE和ape之间的一致性协议。我们还提出了一种TM的优化技术,称为中止预测。它阻止事务运行,直到消除潜在冲突的可能性。我们在单个Cell BE处理器上实现了TM系统和优化技术,并通过六个计算密集型基准测试应用程序评估了它们的有效性。
{"title":"A software-SVM-based transactional memory for multicore accelerator architectures with local memory","authors":"Jun Lee, Sangmin Seo, Jaejin Lee","doi":"10.1145/1854273.1854355","DOIUrl":"https://doi.org/10.1145/1854273.1854355","url":null,"abstract":"We propose a software transactional memory (STM) for heterogeneous multicores with small local memory. The heterogeneous multicore architecture consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE is typically backed by a deep, on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small, explicitly addressed local memory that is not coherent with the main memory. Programmers of such multicore architectures suffer from explicit memory management and coherence problems. The STM for such multicores can alleviate the burden of the programmer and transparently handle data transfers at run time. Moreover, it makes the programmer free from controlling locks. Our TM is based on an existing software SVM for the accelerator architecture. The software SVM exploits software-managed caches and coherence protocols between the GPE and APEs. We also propose an optimization technique, called abort prediction, for the TM. It blocks a transaction from running until the chance of potential conflicts is eliminated. We implement the TM system and the optimization technique for a single Cell BE processor and evaluate their effectiveness with six compute-intensive benchmark applications.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124388534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems AKULA:用于在多核系统上实验和开发线程放置算法的工具集
Sergey Zhuravlev, S. Blagodurov, Alexandra Fedorova
Multicore processors have become commonplace in both desktop and servers. A serious challenge with multicore processors is that cores share on and off chip resources such as caches, memory buses, and memory controllers. Competition for these shared resources between threads running on different cores can result in severe and unpredictable performance degradations. It has been shown in previous work that the OS scheduler can be made shared-resource-aware and can greatly reduce the negative effects of resource contention. The search space of potential scheduling algorithms is huge considering the diversity of available multicore architectures, an almost infinite set of potential workloads, and a variety of conflicting performance goals. We believe the two biggest obstacles to developing new scheduling algorithms are the difficulty of implementation and the duration of testing. We address both of these challenges with our toolset AKULA which we introduce in this paper. AKULA provides an API that allows developers to implement and debug scheduling algorithms easily and quickly without the need to modify the kernel or use system calls. AKULA also provides a rapid evaluation module, based on a novel evaluation technique also introduced in this paper, which allows the created scheduling algorithm to be tested on a wide variety of workloads in just a fraction of the time testing on real hardware would take. AKULA also facilitates running scheduling algorithms created with its API on real machines without the need for additional modifications. We use AKULA to develop and evaluate a variety of different contention-aware scheduling algorithms. We use the rapid evaluation module to test our algorithms on thousands of workloads and assess their scalability to futuristic massively multicore machines.
多核处理器在桌面和服务器中已经变得司空见惯。多核处理器面临的一个严重挑战是,内核共享芯片上和芯片外的资源,如缓存、内存总线和内存控制器。在不同内核上运行的线程之间对这些共享资源的竞争可能导致严重且不可预测的性能下降。以前的研究表明,OS调度器可以感知共享资源,并且可以大大减少资源争用的负面影响。考虑到可用多核架构的多样性、几乎无限的潜在工作负载集以及各种相互冲突的性能目标,潜在调度算法的搜索空间是巨大的。我们认为开发新的调度算法的两个最大障碍是实现的困难和测试的持续时间。我们用我们在本文中介绍的工具集AKULA解决了这两个挑战。AKULA提供了一个API,允许开发人员轻松快速地实现和调试调度算法,而无需修改内核或使用系统调用。AKULA还提供了一个快速评估模块,该模块基于一种新的评估技术,该技术允许创建的调度算法在各种工作负载上进行测试,而只需在实际硬件上进行测试的一小部分时间。AKULA还有助于在真实机器上运行使用其API创建的调度算法,而无需进行额外的修改。我们使用AKULA来开发和评估各种不同的竞争感知调度算法。我们使用快速评估模块在数千种工作负载上测试我们的算法,并评估它们在未来大规模多核机器上的可扩展性。
{"title":"AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems","authors":"Sergey Zhuravlev, S. Blagodurov, Alexandra Fedorova","doi":"10.1145/1854273.1854307","DOIUrl":"https://doi.org/10.1145/1854273.1854307","url":null,"abstract":"Multicore processors have become commonplace in both desktop and servers. A serious challenge with multicore processors is that cores share on and off chip resources such as caches, memory buses, and memory controllers. Competition for these shared resources between threads running on different cores can result in severe and unpredictable performance degradations. It has been shown in previous work that the OS scheduler can be made shared-resource-aware and can greatly reduce the negative effects of resource contention. The search space of potential scheduling algorithms is huge considering the diversity of available multicore architectures, an almost infinite set of potential workloads, and a variety of conflicting performance goals. We believe the two biggest obstacles to developing new scheduling algorithms are the difficulty of implementation and the duration of testing. We address both of these challenges with our toolset AKULA which we introduce in this paper. AKULA provides an API that allows developers to implement and debug scheduling algorithms easily and quickly without the need to modify the kernel or use system calls. AKULA also provides a rapid evaluation module, based on a novel evaluation technique also introduced in this paper, which allows the created scheduling algorithm to be tested on a wide variety of workloads in just a fraction of the time testing on real hardware would take. AKULA also facilitates running scheduling algorithms created with its API on real machines without the need for additional modifications. We use AKULA to develop and evaluate a variety of different contention-aware scheduling algorithms. We use the rapid evaluation module to test our algorithms on thousands of workloads and assess their scalability to futuristic massively multicore machines.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Towards a science of parallel programming 走向并行编程的科学
K. Pingali
How do we give parallel programming a more scientific foundation? In this talk, I will discuss the approach we are taking in the Galois project.
我们如何给并行编程一个更科学的基础?在这次演讲中,我将讨论我们在伽罗瓦项目中采用的方法。
{"title":"Towards a science of parallel programming","authors":"K. Pingali","doi":"10.1145/1854273.1854277","DOIUrl":"https://doi.org/10.1145/1854273.1854277","url":null,"abstract":"How do we give parallel programming a more scientific foundation? In this talk, I will discuss the approach we are taking in the Galois project.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114856104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors 双峰:通用和图形处理器异构计算的软件平台
J. Gummaraju, L. Morichetti, Michael Houston, B. Sander, Benedict R. Gaster, Bixia Zheng
Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for generalpurpose computation. Several languages such as Brook, CUDA , and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs. In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs effi ciently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multi core CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets. Categories a nd Subject D escriptors: D.1.3 [Programming Techniques] : Concurrent Programming G eneral Terms: Design , Experimentation, Performance. K eywords: GPGPU, Multicore , OpenCL, Programmability, Runtime.
现代处理器正在演变成混合的、异构的处理器,CPU和GPU内核都用于通用计算。一些语言,如Brook、CUDA和最近的OpenCL正在开发中,以充分利用这些处理器的潜力。这些语言通常涉及在CPU上运行的控制代码和在gpu上运行的性能关键型数据并行内核代码。在本文中,我们提出了一个异构计算软件平台Twin Peaks,它可以在cpu上高效地执行原本针对gpu的代码。这允许CPU和GPU之间更加平衡的执行,并使这些架构之间的代码可移植性和CPU环境。我们在运行时系统中提出了几种技术来有效地利用cpu中的缓存和功能单元。使用OpenCL作为异构计算的规范语言,并在实际硬件上运行了几个实验,我们表明我们的技术使gpgpu风格的代码能够以最小的运行时开销在多核cpu上有效地执行。这些结果还表明,为了获得最佳性能,应用程序将cpu和gpu同时用作加速器目标是有益的。类别a和主题D描述符:D.1.3[编程技术]:并发编程G一般术语:设计、实验、性能。关键词:GPGPU,多核,OpenCL,可编程性,运行时。
{"title":"Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors","authors":"J. Gummaraju, L. Morichetti, Michael Houston, B. Sander, Benedict R. Gaster, Bixia Zheng","doi":"10.1145/1854273.1854302","DOIUrl":"https://doi.org/10.1145/1854273.1854302","url":null,"abstract":"Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for generalpurpose computation. Several languages such as Brook, CUDA , and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs. In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs effi ciently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multi core CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets. Categories a nd Subject D escriptors: D.1.3 [Programming Techniques] : Concurrent Programming G eneral Terms: Design , Experimentation, Performance. K eywords: GPGPU, Multicore , OpenCL, Programmability, Runtime.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124374599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
An empirical characterization of stream programs and its implications for language and compiler design 流程序的经验表征及其对语言和编译器设计的影响
W. Thies, Saman P. Amarasinghe
Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. In order to develop effective compilation techniques for the streaming domain, it is important to understand the common characteristics of these programs. Prior characterizations of stream programs have examined legacy implementations in C, C++, or FORTRAN, making it difficult to extract the high-level properties of the algorithms. In this work, we characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications. We utilize the StreamIt benchmark suite, consisting of 65 programs and 33,600 lines of code. We characterize the bottlenecks to parallelism, the data reference patterns, the input/output rates, and other properties. The lessons learned have implications for the design of future architectures, languages and compilers for the streaming domain.
流程序代表了一类重要的高性能计算。流程序由它们对数据序列的常规处理来定义,它最常出现在音频、视频和数字信号处理的上下文中,尽管它也出现在网络、加密和其他领域。为了开发有效的流领域编译技术,了解这些程序的共同特征是很重要的。先前对流程序的描述已经检查了C、c++或FORTRAN中的遗留实现,因此很难提取算法的高级属性。在这项工作中,我们描述了一大批直接用流编程语言实现的流程序,从而对应用程序的高级结构和行为有了新的认识。我们使用StreamIt基准套件,由65个程序和33,600行代码组成。我们描述了并行性、数据引用模式、输入/输出速率和其他属性的瓶颈。从中吸取的经验教训对流领域的未来架构、语言和编译器的设计具有启示意义。
{"title":"An empirical characterization of stream programs and its implications for language and compiler design","authors":"W. Thies, Saman P. Amarasinghe","doi":"10.1145/1854273.1854319","DOIUrl":"https://doi.org/10.1145/1854273.1854319","url":null,"abstract":"Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. In order to develop effective compilation techniques for the streaming domain, it is important to understand the common characteristics of these programs. Prior characterizations of stream programs have examined legacy implementations in C, C++, or FORTRAN, making it difficult to extract the high-level properties of the algorithms. In this work, we characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications. We utilize the StreamIt benchmark suite, consisting of 65 programs and 33,600 lines of code. We characterize the bottlenecks to parallelism, the data reference patterns, the input/output rates, and other properties. The lessons learned have implications for the design of future architectures, languages and compilers for the streaming domain.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127038400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 175
Revisiting sorting for GPGPU stream architectures 回顾GPGPU流架构的排序
D. Merrill, A. Grimshaw
This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).
这张海报展示了使用GPGPU流处理器对固定长度键(和值)的大序列进行排序的有效策略。与最先进的技术相比,我们的基数排序方法在所有一代NVIDIA gpgpu上的速度至少提高了2倍,在当前基于gt200的模型上的速度最高可提高3.7倍。我们的实现演示了每秒4.82亿个键值对和每秒5.5亿个键(32位)的排序速率。对于这个排序问题领域,我们相信我们的排序原语是任何完全可编程的微体系结构中最快的。这些结果激发了GPGPU流架构的不同种类的并行原语,这些原语可以更好地利用内存和计算资源,同时保持可重用组件的灵活性。我们的排序性能来源于并行扫描流原语,该原语以两种方式进行了推广:(1)具有用于生产者/消费者操作(访问逻辑)的本地接口,以及(2)具有用于执行多个相关并发前缀扫描(多扫描)的接口。
{"title":"Revisiting sorting for GPGPU stream architectures","authors":"D. Merrill, A. Grimshaw","doi":"10.1145/1854273.1854344","DOIUrl":"https://doi.org/10.1145/1854273.1854344","url":null,"abstract":"This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128936284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 162
Exploiting subtrace-level parallelism in clustered processors 利用集群处理器中的减迹级并行性
R. Ubal, J. Sahuquillo, S. Petit, P. López, J. Duato
The performance evaluation has been carried out on top of the Multi2Sim 2.2 simulation framework [2], a cycle-accurate simulator for x86-based superscalar processors, extended to model a clustered architecture with support for independent subtraces generation. The parameters of the modeled machine are summarized in Table 1. The Mediabench suite has been used to stress the machine, and simulations are stopped after the first 100 million uops commit. The steering algorithm and the interconnection network among clusters are important design factors related with the criticality of the inter-cluster communication latency. For a good baseline performance, the modeled schemes use a sophisticated steering algorithm called topology-aware steering [3], and several interconnection networks with different realistic link delays are considered.
性能评估是在Multi2Sim 2.2仿真框架上进行的[2],这是一个基于x86的超标量处理器的周期精确模拟器,扩展到支持独立子迹生成的集群架构模型。模型机的参数总结如表1所示。mediabbench套件已用于对机器进行压力测试,并且在第一个1亿个ops提交后,模拟将停止。转向算法和集群间互连网络是影响集群间通信时延的重要设计因素。为了获得良好的基线性能,建模方案使用了一种称为拓扑感知转向的复杂转向算法[3],并考虑了具有不同实际链路延迟的几种互连网络。
{"title":"Exploiting subtrace-level parallelism in clustered processors","authors":"R. Ubal, J. Sahuquillo, S. Petit, P. López, J. Duato","doi":"10.1145/1854273.1854349","DOIUrl":"https://doi.org/10.1145/1854273.1854349","url":null,"abstract":"The performance evaluation has been carried out on top of the Multi2Sim 2.2 simulation framework [2], a cycle-accurate simulator for x86-based superscalar processors, extended to model a clustered architecture with support for independent subtraces generation. The parameters of the modeled machine are summarized in Table 1. The Mediabench suite has been used to stress the machine, and simulations are stopped after the first 100 million uops commit. The steering algorithm and the interconnection network among clusters are important design factors related with the criticality of the inter-cluster communication latency. For a good baseline performance, the modeled schemes use a sophisticated steering algorithm called topology-aware steering [3], and several interconnection networks with different realistic link delays are considered.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"124 15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122282623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SWEL: Hardware cache coherence protocols to map shared data onto shared caches 将共享数据映射到共享缓存的硬件缓存一致性协议
Seth H. Pugsley, J. Spjut, D. Nellans, R. Balasubramonian
Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols.
窥探和基于目录的相干协议已经成为芯片多处理器的事实上的标准,但这两种设计都不是没有缺点的。窥探协议不具有可扩展性,而目录协议会产生目录存储开销、频繁的间接访问,并且更容易出现设计错误。在本文中,我们提出了一种新的相干协议,它大大减少了相干操作的数量,并且在需要不频繁的相干时依赖于简单的基于广播的窥探协议。这个新协议的前提是,大多数区块要么是核心私有的,要么是只读的,因此不需要一致性。这对于未来用于在HPC域中执行消息传递工作负载的大型多核机器或用于服务器的多个虚拟机来说尤其如此。在这样的系统中,预计一小部分块将被共享和频繁写入,因此需要针对新的常见情况优化一致性协议。在我们的新协议中,称为SWEL(协议状态为共享、写入、独占级),L1缓存尝试仅存储私有或只读块,而共享和写入块必须驻留在共享L2级。这些决定是在运行时做出的,没有软件的帮助。虽然对从L1删除的块的访问变得更加昂贵,但SWEL可以提高吞吐量,因为对于许多常见的写共享模式,删除了目录间接。与基于MESI的目录实现相比,使用SWEL及其衍生物,我们看到性能提高了15%,最大降低了2%,平均性能提高了2.5%。该策略的其他优点是降低了协议复杂性(通过减少瞬态来实现),并且比传统目录协议显著减少了存储开销。
{"title":"SWEL: Hardware cache coherence protocols to map shared data onto shared caches","authors":"Seth H. Pugsley, J. Spjut, D. Nellans, R. Balasubramonian","doi":"10.1145/1854273.1854331","DOIUrl":"https://doi.org/10.1145/1854273.1854331","url":null,"abstract":"Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125223504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
An intra-tile cache set balancing scheme 一个块内缓存集均衡方案
Mohammad Hammoud, Sangyeun Cho, R. Melhem
This poster describes an intra-tile cache set balancing strategy that exploits the demand imbalance across sets within the same L2 cache bank. This strategy retains some fraction of the working set at underutilized sets so as to satisfy far-flung reuses. It adapts to phase changes in programs and promotes a very flexible sharing among cache sets referred to as many-from-many sharing. Simulation results using a full system simulator demonstrate the effectiveness of the proposed scheme and show that it compares favorably with related cache designs on a 16-way tiled CMP platform.
这张海报描述了一种内部缓存集平衡策略,该策略利用了同一L2缓存银行中不同集的需求不平衡。该策略在未充分利用的集合中保留部分工作集,以满足远距离重用。它适应程序中的阶段变化,并促进缓存集之间非常灵活的共享,称为多从多共享。利用全系统模拟器的仿真结果证明了该方案的有效性,并表明它与16路平铺CMP平台上的相关缓存设计相比具有优势。
{"title":"An intra-tile cache set balancing scheme","authors":"Mohammad Hammoud, Sangyeun Cho, R. Melhem","doi":"10.1145/1854273.1854346","DOIUrl":"https://doi.org/10.1145/1854273.1854346","url":null,"abstract":"This poster describes an intra-tile cache set balancing strategy that exploits the demand imbalance across sets within the same L2 cache bank. This strategy retains some fraction of the working set at underutilized sets so as to satisfy far-flung reuses. It adapts to phase changes in programs and promotes a very flexible sharing among cache sets referred to as many-from-many sharing. Simulation results using a full system simulator demonstrate the effectiveness of the proposed scheme and show that it compares favorably with related cache designs on a 16-way tiled CMP platform.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114716713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scalable hardware support for conditional parallelization 可伸缩的硬件支持条件并行化
Zheng Li, Olivier Certner, J. Duato, O. Temam
Parallel programming approaches based on task division/-spawning are getting increasingly popular because they provide for a simple and elegant abstraction of parallelization, while achieving good performance on workloads which are traditionally complex to parallelize due to the complex control flow and data structures involved. The ability to quickly distribute fine-granularity tasks among many cores is key to the efficiency and scalability of such division-based parallel programming approaches. For this reason, several hardware supports for work stealing environments have already been proposed. However, they all rely on a central hardware structure for distributing tasks among cores, which hampers the scalability and efficiency of these schemes. In this paper, we focus on conditional division, a division-based parallel approach which provides the additional benefit, over work-stealing approaches, of releasing the user from dealing with task granularity and which does not clog hardware resources with an exceedingly large number of small tasks. For this type of division-based approaches, we show that it is possible to design hardware support for speeding up task division that entirely relies on local information, and which thus exhibits good scalability properties.
基于任务划分/生成的并行编程方法正变得越来越流行,因为它们提供了简单而优雅的并行抽象,同时在传统上由于涉及复杂的控制流和数据结构而难以并行化的工作负载上实现良好的性能。在多个核之间快速分配细粒度任务的能力是这种基于除法的并行编程方法的效率和可伸缩性的关键。由于这个原因,已经提出了几种对工作窃取环境的硬件支持。然而,它们都依赖于一个中央硬件结构来在核心之间分配任务,这阻碍了这些方案的可扩展性和效率。在本文中,我们将重点放在条件除法上,这是一种基于除法的并行方法,与窃取工作的方法相比,它提供了额外的好处,即将用户从处理任务粒度中解放出来,并且不会因为大量的小任务而阻塞硬件资源。对于这种基于划分的方法,我们表明可以设计硬件支持来加速完全依赖于本地信息的任务划分,从而表现出良好的可伸缩性属性。
{"title":"Scalable hardware support for conditional parallelization","authors":"Zheng Li, Olivier Certner, J. Duato, O. Temam","doi":"10.1145/1854273.1854297","DOIUrl":"https://doi.org/10.1145/1854273.1854297","url":null,"abstract":"Parallel programming approaches based on task division/-spawning are getting increasingly popular because they provide for a simple and elegant abstraction of parallelization, while achieving good performance on workloads which are traditionally complex to parallelize due to the complex control flow and data structures involved. The ability to quickly distribute fine-granularity tasks among many cores is key to the efficiency and scalability of such division-based parallel programming approaches. For this reason, several hardware supports for work stealing environments have already been proposed. However, they all rely on a central hardware structure for distributing tasks among cores, which hampers the scalability and efficiency of these schemes. In this paper, we focus on conditional division, a division-based parallel approach which provides the additional benefit, over work-stealing approaches, of releasing the user from dealing with task granularity and which does not clog hardware resources with an exceedingly large number of small tasks. For this type of division-based approaches, we show that it is possible to design hardware support for speeding up task division that entirely relies on local information, and which thus exhibits good scalability properties.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122712924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1