首页 > 最新文献

2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

英文 中文
Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory 在硬件事务性内存中使用可重构L1数据缓存进行有效的版本管理
Adrià Armejach, A. Seyedi, Rubén Titos-Gil, I. Hur, Adri´n Cristal, O. Unsal, M. Valero
Transactional Memory (TM) potentially simplifies parallel programming by providing atomicity and isolation for executed transactions. One of the key mechanisms to provide such properties is version management, which defines where and how transactional updates (new values) are stored. Version management can be implemented either eagerly or lazily. In Hardware Transactional Memory (HTM) implementations, eager version management puts new values in-place and old values are kept in a software log, while lazy version management stores new values in hardware buffers keeping old values in-place. Current HTM implementations, for both eager and lazy version management schemes, suffer from performance penalties due to the inability to handle two versions of the same logical data efficiently. In this paper, we introduce a reconfigurable L1 data cache architecture that has two execution modes: a 64KB general purpose mode and a 32KB TM mode which is able to manage two versions of the same logical data. The latter allows to handle old and new transactional values within the cache simultaneously when executing transactional workloads. We explain in detail the architectural design and internals of this Reconfigurable Data Cache (RDC), as well as the supported operations that allow to efficiently solve existing version management problems. We describe how the RDC can support both eager and lazy HTM systems, and we present two RDC-HTM designs. Our evaluation shows that the Eager-RDC-HTM and Lazy-RDC-HTM systems achieve 1.36x and 1.18x speedup, respectively, over state-of-the-art proposals. We also evaluate the area and energy effects of our proposal, and we find that RDC designs are 1.92x and 1.38x more energy-delay efficient compared to baseline HTM systems, with less than 0.3% area impact on modern processors.
事务性内存(TM)通过为执行的事务提供原子性和隔离性,可能简化并行编程。提供这些属性的关键机制之一是版本管理,它定义了存储事务更新(新值)的位置和方式。版本管理的实现可以是主动的,也可以是惰性的。在硬件事务性内存(Hardware Transactional Memory, HTM)实现中,即时版本管理将新值保存在适当位置,旧值保存在软件日志中,而延迟版本管理将新值存储在硬件缓冲区中,并保留旧值。由于无法有效地处理相同逻辑数据的两个版本,当前的HTM实现(无论是急于版本管理方案还是惰性版本管理方案)都会遭受性能损失。在本文中,我们介绍了一种可重构的L1数据缓存架构,该架构具有两种执行模式:64KB通用模式和32KB TM模式,后者能够管理相同逻辑数据的两个版本。后者允许在执行事务性工作负载时同时处理缓存中的新旧事务性值。我们详细解释了这个可重构数据缓存(RDC)的架构设计和内部结构,以及支持的操作,这些操作允许有效地解决现有的版本管理问题。我们描述了RDC如何支持渴望和惰性HTM系统,并给出了两种RDC-HTM设计。我们的评估表明,与最先进的方案相比,Eager-RDC-HTM和Lazy-RDC-HTM系统分别实现了1.36倍和1.18倍的加速。我们还评估了我们的建议的面积和能源效应,我们发现RDC设计比基准HTM系统的能源延迟效率高1.92倍和1.38倍,对现代处理器的面积影响小于0.3%。
{"title":"Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory","authors":"Adrià Armejach, A. Seyedi, Rubén Titos-Gil, I. Hur, Adri´n Cristal, O. Unsal, M. Valero","doi":"10.1109/PACT.2011.67","DOIUrl":"https://doi.org/10.1109/PACT.2011.67","url":null,"abstract":"Transactional Memory (TM) potentially simplifies parallel programming by providing atomicity and isolation for executed transactions. One of the key mechanisms to provide such properties is version management, which defines where and how transactional updates (new values) are stored. Version management can be implemented either eagerly or lazily. In Hardware Transactional Memory (HTM) implementations, eager version management puts new values in-place and old values are kept in a software log, while lazy version management stores new values in hardware buffers keeping old values in-place. Current HTM implementations, for both eager and lazy version management schemes, suffer from performance penalties due to the inability to handle two versions of the same logical data efficiently. In this paper, we introduce a reconfigurable L1 data cache architecture that has two execution modes: a 64KB general purpose mode and a 32KB TM mode which is able to manage two versions of the same logical data. The latter allows to handle old and new transactional values within the cache simultaneously when executing transactional workloads. We explain in detail the architectural design and internals of this Reconfigurable Data Cache (RDC), as well as the supported operations that allow to efficiently solve existing version management problems. We describe how the RDC can support both eager and lazy HTM systems, and we present two RDC-HTM designs. Our evaluation shows that the Eager-RDC-HTM and Lazy-RDC-HTM systems achieve 1.36x and 1.18x speedup, respectively, over state-of-the-art proposals. We also evaluate the area and energy effects of our proposal, and we find that RDC designs are 1.92x and 1.38x more energy-delay efficient compared to baseline HTM systems, with less than 0.3% area impact on modern processors.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121528355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A Software-Managed Coherent Memory Architecture for Manycores 软件管理的多核相干内存体系结构
Jungho Park, Choonki Jang, Jaejin Lee
Cache coherent Non-Uniform Memory Access (cc-NUMA) architectures have been widely used for chip multiprocessors (CMPs). However, they require complicated hardware to properly handle the cache coherence problem. Moreover, it generates heavy on-chip network traffic due to the coherence enforcement. In this work, we propose a simple software-managed coherent memory architecture for many cores. Our memory architecture exploits explicitly addressed local stores. Instead of implementing the complicated cache coherence protocol in hardware, coherence and consistency are supported by software, such as a runtime or an operating system. The local stores together with the software leverage conventional caches to make the architecture much simpler and to generate much less network traffic than conventional ccNUMA-based CMPs. Experimental results indicate that our approach is promising.
高速缓存相干非均匀存储器访问(cc-NUMA)体系结构已广泛应用于芯片多处理器(cmp)。然而,它们需要复杂的硬件来正确处理缓存一致性问题。此外,由于一致性强制,它会产生大量的片上网络流量。在这项工作中,我们提出了一个简单的多核软件管理的相干存储器架构。我们的内存架构利用了显式寻址的本地存储。而不是在硬件上实现复杂的缓存一致性协议,一致性和一致性由软件支持,如运行时或操作系统。与传统的基于ccnuma的cmp相比,本地存储和软件利用传统的缓存使体系结构更简单,并产生更少的网络流量。实验结果表明,我们的方法是有希望的。
{"title":"A Software-Managed Coherent Memory Architecture for Manycores","authors":"Jungho Park, Choonki Jang, Jaejin Lee","doi":"10.1109/PACT.2011.46","DOIUrl":"https://doi.org/10.1109/PACT.2011.46","url":null,"abstract":"Cache coherent Non-Uniform Memory Access (cc-NUMA) architectures have been widely used for chip multiprocessors (CMPs). However, they require complicated hardware to properly handle the cache coherence problem. Moreover, it generates heavy on-chip network traffic due to the coherence enforcement. In this work, we propose a simple software-managed coherent memory architecture for many cores. Our memory architecture exploits explicitly addressed local stores. Instead of implementing the complicated cache coherence protocol in hardware, coherence and consistency are supported by software, such as a runtime or an operating system. The local stores together with the software leverage conventional caches to make the architecture much simpler and to generate much less network traffic than conventional ccNUMA-based CMPs. Experimental results indicate that our approach is promising.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123368927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance Per Watt Benefits of Dynamic Core Morphing in Asymmetric Multicores 非对称多核中动态核变形的每瓦性能优势
Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu, O. Khan
The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications can be parallelized and benefit from multiple cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance/watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a Dynamic Core Morphing (DCM) capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware-based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance/watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance/watt gains of 43% and 16% on an average, when compared to the homogeneous and baseline heterogeneous configurations, respectively.
多核处理器的趋势是将计算的重点从顺序处理转移到并行处理。然而,并非所有应用程序都可以并行化并从多核中获益。这样的应用程序会导致并行资源利用率不足,从而导致性能/瓦特次优。然而,它们可能受益于强大的单处理器。另一方面,并非所有应用程序都能利用更强大的单处理器。为了满足不同应用的竞争需求,我们提出了一种具有动态核心变形(DCM)能力的异构多核架构。根据当前执行的应用程序的计算需求,一些紧耦合核心的资源在运行时发生了变化。我们提出了一个简单的基于硬件的算法来监控应用程序的时变计算需求,并在认为有益的情况下,在细粒度时间尺度上触发核心的重新配置,以最大限度地提高应用程序的性能/瓦特。然后将提出的动态方案与基线静态异构多核配置和等效同构配置进行比较。我们的研究结果表明,与均匀配置和基准异构配置相比,内核的动态变形可以提供43%和16%的平均性能/瓦特增益。
{"title":"Performance Per Watt Benefits of Dynamic Core Morphing in Asymmetric Multicores","authors":"Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu, O. Khan","doi":"10.1109/PACT.2011.18","DOIUrl":"https://doi.org/10.1109/PACT.2011.18","url":null,"abstract":"The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications can be parallelized and benefit from multiple cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance/watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a Dynamic Core Morphing (DCM) capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware-based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance/watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance/watt gains of 43% and 16% on an average, when compared to the homogeneous and baseline heterogeneous configurations, respectively.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127914422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
MCFQ: Leveraging Memory-level Parallelism and Application's Cache Friendliness for Efficient Management of Quasi-partitioned Last-level Caches MCFQ:利用内存级并行性和应用程序的缓存友好性来有效管理准分区的最后一级缓存
Dimitris Kaseridis, M. Iqbal, Jeffrey Stuecheli, L. John
To achieve high efficiency and prevent destructive interference among multiple divergent workloads, the last-level cache of Chip Multiprocessors has to be carefully managed. Previously proposed cache management schemes suffer from inefficient cache capacity utilization, by either focusing on improving the absolute number of cache misses or by allocating cache capacity without taking into consideration the applications' memory sharing characteristics. In this work we propose a quasi-partitioning scheme for last-level caches, MCFQ, that combines the memory-level parallelism, cache friendliness and interference sensitivity of competing applications, to efficiently manage the shared cache capacity. The proposed scheme improves both system throughput and execution fairness -- outperforming previous schemes that are oblivious to applications' memory behavior. Our detailed, full-system simulations showed an average improvement of 10% in throughput and 9% in fairness over the next best scheme for a 4-core CMP system.
为了实现高效率和防止多个不同工作负载之间的破坏性干扰,必须仔细管理芯片多处理器的最后一级缓存。以前提出的缓存管理方案要么专注于提高缓存丢失的绝对数量,要么在分配缓存容量时不考虑应用程序的内存共享特性,因此存在缓存容量利用率低的问题。在这项工作中,我们提出了一种最后一级缓存的准分区方案,MCFQ,它结合了内存级并行性,缓存友好性和竞争应用程序的干扰敏感性,以有效地管理共享缓存容量。所提出的方案提高了系统吞吐量和执行公平性,优于之前对应用程序内存行为无关的方案。我们详细的全系统模拟显示,与4核CMP系统的次优方案相比,吞吐量平均提高了10%,公平性提高了9%。
{"title":"MCFQ: Leveraging Memory-level Parallelism and Application's Cache Friendliness for Efficient Management of Quasi-partitioned Last-level Caches","authors":"Dimitris Kaseridis, M. Iqbal, Jeffrey Stuecheli, L. John","doi":"10.1109/PACT.2011.74","DOIUrl":"https://doi.org/10.1109/PACT.2011.74","url":null,"abstract":"To achieve high efficiency and prevent destructive interference among multiple divergent workloads, the last-level cache of Chip Multiprocessors has to be carefully managed. Previously proposed cache management schemes suffer from inefficient cache capacity utilization, by either focusing on improving the absolute number of cache misses or by allocating cache capacity without taking into consideration the applications' memory sharing characteristics. In this work we propose a quasi-partitioning scheme for last-level caches, MCFQ, that combines the memory-level parallelism, cache friendliness and interference sensitivity of competing applications, to efficiently manage the shared cache capacity. The proposed scheme improves both system throughput and execution fairness -- outperforming previous schemes that are oblivious to applications' memory behavior. Our detailed, full-system simulations showed an average improvement of 10% in throughput and 9% in fairness over the next best scheme for a 4-core CMP system.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116460732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control 通过异步数据转换和自适应控制增强动态仿真的数据局部性
Bo Wu, E. Zhang, Xipeng Shen
Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine the implications that modern heterogeneous Chip Multiprocessors (CMP) architecture imposes on the optimization paradigm. We develop three techniques to enhance the optimizations. The first is asynchronous data transformation, which moves data reordering off the critical path through dependence circumvention. The second is a novel data transformation algorithm, named TLayout, designed specially to take advantage of modern throughput-oriented processors. Together they provide two complementary ways to attack a benefit-overhead dilemma inherited in traditional techniques. Working with a dynamic adaptation scheme, the techniques produce significant performance improvement for a set of dynamic simulation benchmarks.
许多动态模拟程序包含复杂的、不规则的内存引用模式,并且需要运行时优化来增强数据局部性。当前的方法是周期性地停止应用程序的执行,以根据当前程序状态重新排序计算或数据,以改进下一个执行周期的数据局部性。在这项工作中,我们研究了现代异构芯片多处理器(CMP)架构对优化范式的影响。我们开发了三种技术来增强优化。第一种是异步数据转换,它通过依赖规避将数据重新排序移出关键路径。第二种是一种新的数据转换算法,名为TLayout,专门设计用于利用现代面向吞吐量的处理器。它们一起提供了两种互补的方法来解决传统技术中继承的利益开销困境。使用动态自适应方案,这些技术对一组动态模拟基准测试产生了显著的性能改进。
{"title":"Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control","authors":"Bo Wu, E. Zhang, Xipeng Shen","doi":"10.1109/PACT.2011.56","DOIUrl":"https://doi.org/10.1109/PACT.2011.56","url":null,"abstract":"Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine the implications that modern heterogeneous Chip Multiprocessors (CMP) architecture imposes on the optimization paradigm. We develop three techniques to enhance the optimizations. The first is asynchronous data transformation, which moves data reordering off the critical path through dependence circumvention. The second is a novel data transformation algorithm, named TLayout, designed specially to take advantage of modern throughput-oriented processors. Together they provide two complementary ways to attack a benefit-overhead dilemma inherited in traditional techniques. Working with a dynamic adaptation scheme, the techniques produce significant performance improvement for a set of dynamic simulation benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115700881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Exploiting Task Order Information for Optimizing Sequentially Consistent Java Programs 利用任务顺序信息优化顺序一致的Java程序
C. Angerer, T. Gross
Java was designed as a secure language that supports running untrusted code as part of trusted applications. For safety reasons, Java therefore defines a memory model that prevents undefined behavior in multi-threaded programs even if the programs are not correctly synchronized. Because of the potential negative performance impact the Java designers did not choose a simple and natural memory model, such as sequential consistency, but instead developed a relaxed memory model that gives the compiler more optimization opportunities. As it is today, however, the relaxed Java Memory Model is not only hard to understand but it unnecessarily complicates reasoning about parallel programs and it turned out to be difficult to implement correctly. This paper presents an optimizing compiler for a Java version that has sequential consistency as its memory model. Based on a programming model with explicit happens-before constraints between tasks, we describe a static schedule analysis that computes whether two tasks may be executed in parallel or if they are ordered. During optimization, the task-ordering information is exploited to reduce the number of volatile memory accesses the compiler must insert to guarantee sequential consistency. The evaluation shows that scheduling information significantly improves the effectiveness of the optimizations. For our set of multi-threaded benchmarks the fully optimizing compiler removes between 70% and 100% of the volatile memory accesses inserted by the non-optimizing compiler. As a result, the overhead of sequentially consistent Java compared to standard Java is reduced from 136% on average for the unoptimized version to 11% on average for the optimized version. The results indicate that with appropriate optimizations, sequential consistency can be a feasible alternative to the Java Memory Model.
Java被设计为一种安全语言,它支持将不受信任的代码作为受信任的应用程序的一部分运行。出于安全考虑,Java因此定义了一个内存模型来防止多线程程序中的未定义行为,即使程序没有正确同步。由于潜在的负面性能影响,Java设计人员没有选择简单而自然的内存模型,比如顺序一致性,而是开发了一种宽松的内存模型,为编译器提供了更多的优化机会。然而,就像今天一样,宽松的Java内存模型不仅难以理解,而且不必要地使并行程序的推理复杂化,并且难以正确实现。本文提出了一种针对Java版本的优化编译器,该版本采用顺序一致性作为其内存模型。基于任务之间具有显式happens-before约束的编程模型,我们描述了一个静态调度分析,该分析计算两个任务是否可以并行执行或是否顺序执行。在优化过程中,利用任务排序信息来减少编译器必须插入的易失性内存访问次数,以保证顺序一致性。评估结果表明,调度信息显著提高了优化的有效性。对于我们的多线程基准测试集,完全优化的编译器删除了由非优化编译器插入的70%到100%的易失性内存访问。因此,与标准Java相比,顺序一致性Java的开销从未优化版本的平均136%降低到优化版本的平均11%。结果表明,通过适当的优化,顺序一致性可以成为Java内存模型的可行替代方案。
{"title":"Exploiting Task Order Information for Optimizing Sequentially Consistent Java Programs","authors":"C. Angerer, T. Gross","doi":"10.1109/PACT.2011.70","DOIUrl":"https://doi.org/10.1109/PACT.2011.70","url":null,"abstract":"Java was designed as a secure language that supports running untrusted code as part of trusted applications. For safety reasons, Java therefore defines a memory model that prevents undefined behavior in multi-threaded programs even if the programs are not correctly synchronized. Because of the potential negative performance impact the Java designers did not choose a simple and natural memory model, such as sequential consistency, but instead developed a relaxed memory model that gives the compiler more optimization opportunities. As it is today, however, the relaxed Java Memory Model is not only hard to understand but it unnecessarily complicates reasoning about parallel programs and it turned out to be difficult to implement correctly. This paper presents an optimizing compiler for a Java version that has sequential consistency as its memory model. Based on a programming model with explicit happens-before constraints between tasks, we describe a static schedule analysis that computes whether two tasks may be executed in parallel or if they are ordered. During optimization, the task-ordering information is exploited to reduce the number of volatile memory accesses the compiler must insert to guarantee sequential consistency. The evaluation shows that scheduling information significantly improves the effectiveness of the optimizations. For our set of multi-threaded benchmarks the fully optimizing compiler removes between 70% and 100% of the volatile memory accesses inserted by the non-optimizing compiler. As a result, the overhead of sequentially consistent Java compared to standard Java is reduced from 136% on average for the unoptimized version to 11% on average for the optimized version. The results indicate that with appropriate optimizations, sequential consistency can be a feasible alternative to the Java Memory Model.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125020476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores SFMalloc:一个无锁且无同步的多核动态内存分配器
Sangmin Seo, Junghyun Kim, Jaejin Lee
As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this paper, we present a new dynamic memory allocator for multi-threaded applications. The allocator never uses any synchronization for common cases. It uses only lock-free synchronization mechanisms for uncommon cases. Each thread owns a private heap and handles memory requests on the heap. Our allocator is completely synchronization-free when a thread allocates a memory block and deal locates it by itself. Synchronization-free means that threads do not communicate with each other at all. On the other hand, if a thread allocates a block and another thread frees it, we use a lock-free stack to atomically add it to the owner thread's heap to avoid the memory blowup problem. Furthermore, our allocator exploits various memory block caching mechanisms to reduce the latency of memory management. Freed blocks or intermediate memory chunks are cached hierarchically in each thread's heap and they are used for future memory allocation. We compare the performance and scalability of our allocator to those of well-known existing multi-threaded memory allocators using eight benchmarks. Experimental results on a 48-core AMD system show that our approach achieves better performance than other allocators for all benchmarks and is highly scalable with a large number of threads.
由于多核处理器的出现,并行编程成为主流,C和c++中使用的动态内存分配器如果不能扩展,可能会抑制多线程应用程序的性能。本文提出了一种新的多线程应用动态内存分配器。对于常见情况,分配器从不使用任何同步。它只在不常见的情况下使用无锁同步机制。每个线程拥有一个私有堆,并在堆上处理内存请求。当线程分配内存块并自己定位它时,我们的分配器是完全不同步的。无同步意味着线程完全不相互通信。另一方面,如果一个线程分配了一个块,而另一个线程释放了它,我们使用无锁堆栈自动将它添加到所有者线程的堆中,以避免内存爆炸问题。此外,我们的分配器利用各种内存块缓存机制来减少内存管理的延迟。释放的块或中间内存块分层地缓存在每个线程的堆中,并用于将来的内存分配。我们使用8个基准测试,将我们的分配器的性能和可伸缩性与那些知名的现有多线程内存分配器进行比较。在48核AMD系统上的实验结果表明,我们的方法在所有基准测试中都比其他分配器取得了更好的性能,并且在大量线程下具有很高的可扩展性。
{"title":"SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores","authors":"Sangmin Seo, Junghyun Kim, Jaejin Lee","doi":"10.1109/PACT.2011.57","DOIUrl":"https://doi.org/10.1109/PACT.2011.57","url":null,"abstract":"As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this paper, we present a new dynamic memory allocator for multi-threaded applications. The allocator never uses any synchronization for common cases. It uses only lock-free synchronization mechanisms for uncommon cases. Each thread owns a private heap and handles memory requests on the heap. Our allocator is completely synchronization-free when a thread allocates a memory block and deal locates it by itself. Synchronization-free means that threads do not communicate with each other at all. On the other hand, if a thread allocates a block and another thread frees it, we use a lock-free stack to atomically add it to the owner thread's heap to avoid the memory blowup problem. Furthermore, our allocator exploits various memory block caching mechanisms to reduce the latency of memory management. Freed blocks or intermediate memory chunks are cached hierarchically in each thread's heap and they are used for future memory allocation. We compare the performance and scalability of our allocator to those of well-known existing multi-threaded memory allocators using eight benchmarks. Experimental results on a 48-core AMD system show that our approach achieves better performance than other allocators for all benchmarks and is highly scalable with a large number of threads.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU 基于多核CPU和GPU的高效并行图探索
Sungpack Hong, Tayo Oguntebi, K. Olukotun
Graphs are a fundamental data representation that has been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems, a high-end GPU system performed as well as a quad-socket high-end CPU system.
图是一种基本的数据表示形式,已广泛应用于各个领域。在基于图的应用程序中,对图的系统探索(如广度优先搜索(BFS))通常是处理其海量数据集的关键组件。本文提出了一种在多核cpu上实现并行BFS算法的新方法,该方法利用了随机形状的真实世界图实例的基本特性。通过更有效地利用内存带宽,我们的方法比当前最先进的实现显示出更高的性能,并随着图大小的增加而增加其优势。然后,我们提出了一种混合方法,对于每个级别的BFS算法,动态选择最佳实现:顺序执行,两种不同的多核执行方法和GPU执行。这种混合方法为每个图大小提供了最佳性能,同时避免了大直径图的最差性能。最后,我们通过比较多个CPU和GPU系统,高端GPU系统以及四插槽高端CPU系统的性能,研究底层架构对BFS性能的影响。
{"title":"Efficient Parallel Graph Exploration on Multi-Core CPU and GPU","authors":"Sungpack Hong, Tayo Oguntebi, K. Olukotun","doi":"10.1109/PACT.2011.14","DOIUrl":"https://doi.org/10.1109/PACT.2011.14","url":null,"abstract":"Graphs are a fundamental data representation that has been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems, a high-end GPU system performed as well as a quad-socket high-end CPU system.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128018495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 290
An Architecture to Enable Lifetime Full Chip Testability in Chip Multiprocessors 在芯片多处理器中实现终身全芯片可测试性的架构
Rance Rodrigues, I. Koren, S. Kundu
Technology scaling has led to a tremendous increase in the packing density of transistors. However, these small transistors are susceptible to certain impediments that were not present earlier. Manufacturability suffers due to trailing lithography technology which does not scale well with transistor technology. Increased leakage current has reduced effectiveness of burn-in tests. Infant mortality cannot therefore, be completely kept under check. Even during operation, reliability is affected due to CMOS wear-out mechanisms such as time-dependent dielectric breakdown (TDDB), hot carrier injection (HCI), negative bias temperature instability (NBTI), electro migration (EM), and stress induced voiding (SIV).
技术的规模化使得晶体管的封装密度有了极大的提高。然而,这些小型晶体管容易受到某些先前不存在的障碍的影响。由于尾随光刻技术不能很好地与晶体管技术相结合,可制造性受到影响。增大的泄漏电流降低了老化试验的有效性。因此,婴儿死亡率不能完全得到控制。即使在运行过程中,由于CMOS损耗机制(如时间相关介电击穿(TDDB)、热载流子注入(HCI)、负偏置温度不稳定性(NBTI)、电迁移(EM)和应力诱导空化(SIV)),可靠性也会受到影响。
{"title":"An Architecture to Enable Lifetime Full Chip Testability in Chip Multiprocessors","authors":"Rance Rodrigues, I. Koren, S. Kundu","doi":"10.1109/PACT.2011.52","DOIUrl":"https://doi.org/10.1109/PACT.2011.52","url":null,"abstract":"Technology scaling has led to a tremendous increase in the packing density of transistors. However, these small transistors are susceptible to certain impediments that were not present earlier. Manufacturability suffers due to trailing lithography technology which does not scale well with transistor technology. Increased leakage current has reduced effectiveness of burn-in tests. Infant mortality cannot therefore, be completely kept under check. Even during operation, reliability is affected due to CMOS wear-out mechanisms such as time-dependent dielectric breakdown (TDDB), hot carrier injection (HCI), negative bias temperature instability (NBTI), electro migration (EM), and stress induced voiding (SIV).","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"30 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131826711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPATL: Honey, I Shrunk the Coherence Directory 亲爱的,我缩小了一致性目录
Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas, V. Srinivasan
One of the key scalability challenges of on-chip coherence in a multicore chip is the coherence directory, which provides information on sharing of cache blocks. Shadow tags that duplicate entire private cache tag arrays are widely used to minimize area overhead, but require an energy-intensive associative search to obtain the sharing information. Recent research proposed a Tagless directory, which uses bloom filters to summarize the tags in a cache set. The Tagless directory associates the sharing vector with the bloom filter buckets to completely eliminate the associative lookup and reduce the directory overhead. However, Tagless still uses a full map sharing vector to represent the sharing information, resulting in remaining area and energy challenges with increasing core counts. In this paper, we first show that due to the regular nature of applications, many bloom filters essentially replicate the same sharing pattern. We next exploit the pattern commonality and propose SPATL (Sharing-pattern based Tagless Directory). SPATL exploits the sharing pattern commonality to decouple the sharing patterns from the bloom filters and eliminates the redundant copies of sharing patterns. SPATL works with both inclusive and noninclusive shared caches and provides 34% storage savings over Tagless, the previous most storage-efficient directory, at 16 cores. We study multiple strategies to periodically eliminate the false sharing that comes from combining sharing pattern compression with Tagless, and demonstrate that SPATL can achieve the same level of false sharers as Tagless with 5% extra bandwidth. Finally, we demonstrate that SPATL scales even better than an idealized directory and can support 1024-core chips with less than 1% of the private cache space for data parallel applications.
在多核芯片中实现片上一致性的关键可扩展性挑战之一是一致性目录,它提供了关于缓存块共享的信息。复制整个私有缓存标记数组的影子标记被广泛用于最小化区域开销,但需要能量密集的关联搜索来获取共享信息。最近的研究提出了一种无标签目录,它使用bloom过滤器来总结缓存集中的标签。无标签目录将共享向量与bloom过滤器桶关联起来,以完全消除关联查找并减少目录开销。然而,Tagless仍然使用完整的地图共享向量来表示共享信息,导致随着核心数量的增加,剩余面积和能量的挑战。在本文中,我们首先表明,由于应用程序的规则性质,许多bloom过滤器本质上复制相同的共享模式。接下来,我们利用模式通用性,提出了基于共享模式的无标签目录(SPATL)。SPATL利用共享模式的通用性将共享模式与布隆过滤器解耦,并消除了共享模式的冗余副本。SPATL可以与包含和不包含的共享缓存一起工作,并且比Tagless(以前存储效率最高的目录)节省34%的存储,只有16核。研究了将共享模式压缩与Tagless相结合所产生的周期性错误共享的多种策略,并证明SPATL可以在额外5%的带宽下实现与Tagless相同级别的错误共享。最后,我们证明了SPATL的可伸缩性甚至比理想的目录更好,并且可以用不到1%的私有缓存空间为数据并行应用程序支持1024核芯片。
{"title":"SPATL: Honey, I Shrunk the Coherence Directory","authors":"Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas, V. Srinivasan","doi":"10.1109/PACT.2011.10","DOIUrl":"https://doi.org/10.1109/PACT.2011.10","url":null,"abstract":"One of the key scalability challenges of on-chip coherence in a multicore chip is the coherence directory, which provides information on sharing of cache blocks. Shadow tags that duplicate entire private cache tag arrays are widely used to minimize area overhead, but require an energy-intensive associative search to obtain the sharing information. Recent research proposed a Tagless directory, which uses bloom filters to summarize the tags in a cache set. The Tagless directory associates the sharing vector with the bloom filter buckets to completely eliminate the associative lookup and reduce the directory overhead. However, Tagless still uses a full map sharing vector to represent the sharing information, resulting in remaining area and energy challenges with increasing core counts. In this paper, we first show that due to the regular nature of applications, many bloom filters essentially replicate the same sharing pattern. We next exploit the pattern commonality and propose SPATL (Sharing-pattern based Tagless Directory). SPATL exploits the sharing pattern commonality to decouple the sharing patterns from the bloom filters and eliminates the redundant copies of sharing patterns. SPATL works with both inclusive and noninclusive shared caches and provides 34% storage savings over Tagless, the previous most storage-efficient directory, at 16 cores. We study multiple strategies to periodically eliminate the false sharing that comes from combining sharing pattern compression with Tagless, and demonstrate that SPATL can achieve the same level of false sharers as Tagless with 5% extra bandwidth. Finally, we demonstrate that SPATL scales even better than an idealized directory and can support 1024-core chips with less than 1% of the private cache space for data parallel applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131758586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
期刊
2011 International Conference on Parallel Architectures and Compilation Techniques
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1