2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文中文

A DSL for Performance Orchestration 用于性能编排的DSL

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.50

Thiago Teixeira, D. Padua, W. Gropp

The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interfac

当今计算机体系结构的复杂性和多样性要求软件开发人员给予更多的关注，以便利用所有可用的计算能力。此外，每个不同的现代体系结构都需要一组潜在的非重叠优化，以获得其名义峰值速度的更高比例。这导致了性能可移植性和代码可维护性方面的挑战，特别是，如何管理针对不同架构的相同代码的不同优化版本，以及如何在添加新算法特性时使它们保持最新状态。体系结构复杂性的增加和优化空间的扩展往往会使编译器提供令人不满意的性能，手动调优和编译器生成的代码之间的性能差距急剧扩大。即使使用高级优化标志也不足以缩小这一差距。另一方面，手动优化应用程序非常耗时，开发人员需要了解每种体系结构的许多不同硬件特性并与之交互。已经开展了一些成功的研究，以帮助程序员在实现、优化和将应用程序移植到不同架构的这个痛苦且容易出错的过程中发挥作用。尽管如此，这些作品的采用大多局限于特定的领域，如密集线性代数、傅立叶变换和信号处理。我们已经开发了框架ICE，它将性能专家角色与应用程序专家角色分离(关注点分离)。它允许使用特定于体系结构的优化，同时保持代码的长期可维护性。它负责将多个优化工具的使用编排到应用程序的基线版本，并执行经验搜索以找到最佳优化序列及其参数。基线版本被认为没有任何特定于体系结构或编译器的优化。优化和经验搜索由外部文件中的特定于领域的语言(DSL)指导。通过为所使用的每个体系结构添加多个优化案例，应用程序的代码通常会发生巨大的变化。这个DSL允许性能专家在不打乱原始代码的情况下应用优化。DSL具有公开优化选项的构造，并生成可由不同搜索工具遍历的搜索空间。例如，它具有条件语句，可用于指定应该为每个编译器执行哪些优化。DSL既是经验搜索的输入，也是输出。它可以用来保存在以前的搜索中找到的最佳转换序列。应用程序的代码用在DSL中引用的唯一标识符进行注释。目前，可以接受源到源的循环优化、算法和语用选择。框架接口非常灵活，可以集成新的优化和搜索工具。如果出现任何故障，它会返回到基线版本。我们已经将该框架应用于线性代数问题、模板计算和模拟等离子体耦合燃烧的生产代码~xpacc，实现了高达3倍的加速。其他的工作试图解决促进优化应用程序的问题，但他们缺乏由ICE组成的重要功能。CHiLL、Orio和X语言简化了优化代码的生成。在这些代码中，CHiLL是唯一使用外部文件给出执行优化的指令，但是它根据循环在源中的位置引用循环，并且源中的修改需要在外部文件中修改，这限制了它在大型生产代码中的使用。只有Orio经验地计算带注释的代码的变体。综上所述，该框架的贡献是:关注点分离、增量采用、指定优化空间的DSL、插件和比较不同优化和搜索工具的接口、经验搜索与专家知识的结合。

{"title":"A DSL for Performance Orchestration","authors":"Thiago Teixeira, D. Padua, W. Gropp","doi":"10.1109/PACT.2017.50","DOIUrl":"https://doi.org/10.1109/PACT.2017.50","url":null,"abstract":"The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interfac","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"421 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122796203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

POSTER: Improving Datacenter Efficiency Through Partitioning-Aware Scheduling 海报:通过分区感知调度提高数据中心效率

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.43

H. Kasture, Xu Ji, Nosayba El-Sayed, Nathan Beckmann, Xiaosong Ma, Daniel Sánchez

Datacenter servers often colocate multiple applications to improve utilization and efficiency. However, colocated applications interfere in shared resources, e.g., the last-level cache (LLC) and DRAM bandwidth, causing performance inefficiencies. Prior work has proposed two disjoint approaches to address interference. First, techniques that partition shared resources like the LLC can provide isolation and trade performance among colocated applications within a single node. But partitioning techniques are limited by the fixed resource demands of the applications running on the node. Second, interference-aware schedulers try to find resource-compatible applications and schedule them across nodes to improve performance. But prior schedulers are hampered by the lack of partitioning hardware in conventional multicores, and are forced to take conservative colocation decisions, leaving significant performance on the table. We show that memory-system partitioning and scheduling are complementary, and performing them in a coordinated fashion yields significant benefits. We present Shepherd, a joint scheduler and resource partitioner that seeks to maximize cluster-wide throughput. Shepherd uses detailed application profiling data to partition the shared LLC and to estimate the impact of DRAM bandwidth contention among colocated applications. Shepherd's scheduler leverages this information to colocate applications with complementary resource requirements, improving resource utilization and cluster throughput. We evaluate Shepherd in simulation and on a real cluster with hardware support for cache partitioning. When managing mixes of server and scientific applications, Shepherd improves cluster throughput over an unpartitioned system by 38% on average.

数据中心服务器通常配置多个应用程序，以提高利用率和效率。然而，并置的应用程序会干扰共享资源，例如，最后一级缓存(LLC)和DRAM带宽，从而导致性能低下。先前的工作提出了两种互不相关的方法来解决干扰问题。首先，对共享资源(如LLC)进行分区的技术可以在单个节点内的并发应用程序之间提供隔离和交换性能。但是分区技术受到节点上运行的应用程序的固定资源需求的限制。其次，干扰感知调度器尝试找到资源兼容的应用程序，并跨节点调度它们以提高性能。但是先前的调度器受到传统多核中缺乏分区硬件的限制，并且被迫采取保守的主机配置决策，从而在表上留下了重要的性能。我们展示了内存系统分区和调度是互补的，以协调的方式执行它们会产生显著的好处。我们介绍Shepherd，它是一个联合调度器和资源分区器，旨在最大化集群范围内的吞吐量。Shepherd使用详细的应用程序分析数据来对共享的LLC进行分区，并估计并发应用程序之间DRAM带宽争用的影响。Shepherd的调度器利用这些信息来配置具有互补资源需求的应用程序，从而提高资源利用率和集群吞吐量。我们在模拟和具有硬件支持缓存分区的真实集群上评估了Shepherd。在管理服务器和科学应用程序的混合时，Shepherd将未分区系统上的集群吞吐量平均提高了38%。

{"title":"POSTER: Improving Datacenter Efficiency Through Partitioning-Aware Scheduling","authors":"H. Kasture, Xu Ji, Nosayba El-Sayed, Nathan Beckmann, Xiaosong Ma, Daniel Sánchez","doi":"10.1109/PACT.2017.43","DOIUrl":"https://doi.org/10.1109/PACT.2017.43","url":null,"abstract":"Datacenter servers often colocate multiple applications to improve utilization and efficiency. However, colocated applications interfere in shared resources, e.g., the last-level cache (LLC) and DRAM bandwidth, causing performance inefficiencies. Prior work has proposed two disjoint approaches to address interference. First, techniques that partition shared resources like the LLC can provide isolation and trade performance among colocated applications within a single node. But partitioning techniques are limited by the fixed resource demands of the applications running on the node. Second, interference-aware schedulers try to find resource-compatible applications and schedule them across nodes to improve performance. But prior schedulers are hampered by the lack of partitioning hardware in conventional multicores, and are forced to take conservative colocation decisions, leaving significant performance on the table. We show that memory-system partitioning and scheduling are complementary, and performing them in a coordinated fashion yields significant benefits. We present Shepherd, a joint scheduler and resource partitioner that seeks to maximize cluster-wide throughput. Shepherd uses detailed application profiling data to partition the shared LLC and to estimate the impact of DRAM bandwidth contention among colocated applications. Shepherd's scheduler leverages this information to colocate applications with complementary resource requirements, improving resource utilization and cluster throughput. We evaluate Shepherd in simulation and on a real cluster with hardware support for cache partitioning. When managing mixes of server and scientific applications, Shepherd improves cluster throughput over an unpartitioned system by 38% on average.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115542725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance Improvement via Always-Abort HTM 通过Always-Abort HTM改进性能

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.16

Joseph Izraelevitz, Lingxiang Xiang, M. Scott

Several research groups have noted that hardware transactional memory (HTM), even in the case of aborts, can have the side effect of warming up the branch predictor and caches, thereby accelerating subsequent execution. We propose to employ this side effect deliberately, in cases where execution must wait for action in another thread. In doing so, we allow "warm-up" transactions to observe inconsistent state. We must therefore ensure that they never accidentally commit. To that end, we propose that the hardware allow the program to specify, at the start of a transaction, that it should in all cases abort, even if it (accidentally) executes a commit instruction. We discuss several scenarios in which always-abort HTM (AAHTM) can be useful, and present lock and barrier implementations that employ it. We demonstrate the value of these implementations on several real-world applications, obtaining performance improvements of up to 2.5x with almost no programmer effort.

几个研究小组已经注意到，硬件事务性内存(HTM)，即使在中断的情况下，也会产生预热分支预测器和缓存的副作用，从而加速后续执行。在执行必须等待另一个线程的操作的情况下，我们建议有意地使用这个副作用。这样，我们就允许“预热”事务观察不一致状态。因此，我们必须确保他们不会不小心犯错误。为此，我们建议硬件允许程序在事务开始时指定在任何情况下都应该中止，即使它(意外地)执行了提交指令。我们讨论了几种始终中止HTM (AAHTM)可以发挥作用的场景，并介绍了使用它的锁和屏障实现。我们在几个实际应用程序上演示了这些实现的价值，几乎不需要程序员的努力就可以获得高达2.5倍的性能改进。

引用次数: 5

Multilayer Compute Resource Management with Robust Control Theory 基于鲁棒控制理论的多层计算资源管理

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.54

Raghavendra Pradyumna Pothukuchi, Sweta Yamini Pothukuchi, P. Voulgaris, J. Torrellas

Multicores increasingly execute in constrained environments, and are being equipped with controllers for resource management. However, modern multicore systems are structured in multiple complex layers, such as the hardware, OS, and networking layers, each with its own resources. Managing such a system scalably and portably requires that we have a controller in each layer, and that the different controllers coordinate their operation. We present a novel methodology to build coordinated multilevel formal controllers in computing. We consider Robust Control Theory, which focuses on decision making in uncertain environments, and pick the popular Structured Singular Value (SSV) controller. This is the first work to utilize Robust Control Theory for compute resource management. We show the effectiveness of multilevel SSV controllers on a real multicore system.

多核越来越多地在受限环境中执行，并配备了用于资源管理的控制器。然而，现代多核系统是由多个复杂层构成的，例如硬件层、操作系统层和网络层，每个层都有自己的资源。管理这样一个可扩展和可移植的系统要求我们在每一层都有一个控制器，并且不同的控制器协调它们的操作。我们提出了一种新的方法来建立计算中协调的多层形式控制器。我们考虑了鲁棒控制理论，该理论关注不确定环境下的决策，并选择了流行的结构化奇异值(SSV)控制器。本文首次将鲁棒控制理论应用于计算资源管理。我们在一个真实的多核系统上展示了多电平SSV控制器的有效性。

引用次数: 0

DrMP: Mixed Precision-Aware DRAM for High Performance Approximate and Precise Computing 用于高性能近似和精确计算的混合精确感知DRAM

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.34

Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang

Recent studies showed that DRAM restore time degrades as technology scales, which imposes large performance and energy overheads. This problem, prolonged restore time (PRT), has been identified by the DRAM industry as one of three major scaling challenges.This paper proposes DrMP, a novel fine-grained precision-aware DRAM restore scheduling approach, to mitigate PRT. The approach exploits process variations (PVs) within and across DRAM rows to save data with mixed precision. The paper describes three variants of the approach: DrMP-A, DrMP-P, and DrMP-U. DrMP-A supports approximate computing by mapping important data bits to fast row segments to reduce restore time for improved performance at a low application error rate. DrMP-P pairs memory rows together to reduce the average restore time for precise computing. DrMP-U combines DrMP-A and DrMP-P to better trade performance, energy consumption, and computation precision. Our experimental results show that, on average, DrMP achieves 20% performance improvement and 15% energy reduction over a precision-oblivious baseline. Further, DrMP achieves an error rate less than 1% at the application level for a suite of benchmarks, including applications that exhibit unacceptable error rates under simple approximation that does not differentiate the importance of different bits.

最近的研究表明，随着技术的扩展，DRAM恢复时间会降低，这会带来很大的性能和能源开销。恢复时间过长(PRT)是DRAM行业面临的三大扩展挑战之一。本文提出了一种新的细粒度精确感知DRAM恢复调度方法DrMP来缓解PRT。该方法利用DRAM行内和行间的过程变化(pv)来以混合精度保存数据。本文描述了该方法的三种变体:DrMP-A、DrMP-P和DrMP-U。DrMP-A通过将重要数据位映射到快速行段来支持近似计算，以减少恢复时间，从而在低应用程序错误率下提高性能。DrMP-P将内存行配对在一起，以减少精确计算的平均恢复时间。DrMP-U结合了DrMP-A和DrMP-P，以提高交易性能、能耗和计算精度。我们的实验结果表明，平均而言，在精度无关的基线上，DrMP实现了20%的性能提升和15%的能耗降低。此外，对于一组基准测试，DrMP在应用程序级别实现了低于1%的错误率，包括在简单近似下表现出不可接受的错误率的应用程序，这种近似不区分不同位的重要性。

{"title":"DrMP: Mixed Precision-Aware DRAM for High Performance Approximate and Precise Computing","authors":"Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang","doi":"10.1109/PACT.2017.34","DOIUrl":"https://doi.org/10.1109/PACT.2017.34","url":null,"abstract":"Recent studies showed that DRAM restore time degrades as technology scales, which imposes large performance and energy overheads. This problem, prolonged restore time (PRT), has been identified by the DRAM industry as one of three major scaling challenges.This paper proposes DrMP, a novel fine-grained precision-aware DRAM restore scheduling approach, to mitigate PRT. The approach exploits process variations (PVs) within and across DRAM rows to save data with mixed precision. The paper describes three variants of the approach: DrMP-A, DrMP-P, and DrMP-U. DrMP-A supports approximate computing by mapping important data bits to fast row segments to reduce restore time for improved performance at a low application error rate. DrMP-P pairs memory rows together to reduce the average restore time for precise computing. DrMP-U combines DrMP-A and DrMP-P to better trade performance, energy consumption, and computation precision. Our experimental results show that, on average, DrMP achieves 20% performance improvement and 15% energy reduction over a precision-oblivious baseline. Further, DrMP achieves an error rate less than 1% at the application level for a suite of benchmarks, including applications that exhibit unacceptable error rates under simple approximation that does not differentiate the importance of different bits.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115290078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

POSTER: BigBus: A Scalable Optical Interconnect 海报:BigBus:一个可扩展的光互连

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.18

E. Peter, Janibul Bashir, S. Sarangi

This paper presents BigBus, a novel on-chip photonic network for a 1024 node system. The crux of the idea is to segment the entire system into smaller clusters of nodes, and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens and at a global level, we predict the number of token equivalents of power that the off-chip laser needs to generate.

提出了一种用于1024节点系统的新型片上光子网络BigBus。这个想法的关键是将整个系统分割成更小的节点集群，并对每个节点采用混合策略，包括传统的激光调制，以及在节点之间动态共享功率的新技术。我们内部将能量表示为令牌，其中一个令牌将允许节点向其集群中的任何其他节点发送消息。我们允许光站对令牌进行仲裁，并且在全球层面上，我们预测片外激光器需要产生的令牌等效功率的数量。

引用次数: 1

Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory 非易失性主存中基于循环代码的有效检查点

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.58

Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin

Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.

未来的主存储器可能包括非易失性存储器。非易失性主存储器(NVMM)提供了重新思考为应用程序提供故障安全的检查点策略的机会。虽然文献中有许多检查点和日志记录方案，但必须重新考虑它们的使用，因为它们会导致高执行时间开销以及对NVMM的大量额外写入，这可能会显著影响写入持久性。在本文中，我们提出了一种新的基于重新计算的失效安全方法，并证明了它对基于循环的代码的适用性。而不是保持完全一致的日志记录状态，我们只记录足够的状态以启用重新计算。在发生故障时，我们的方法通过确定计算的哪些部分没有完成并重新计算它们来恢复到一致状态。实际上，我们的方法消除了保留检查点或日志的需要，从而减少了执行时间开销并提高了NVMM写入持久性，但代价是更复杂的恢复。我们在一个基于gem5构建并支持Intel PMEM指令扩展的计算机系统模型上，比较了我们的新方法在五种科学工作负载(包括平铺矩阵乘法)上的日志记录和检查点。对于平铺矩阵乘法，我们的重计算方法的执行时间开销仅为5%，而日志记录的开销为8%，检查点的开销为207%。此外，recompute只增加了7%的额外NVMM写操作，而日志记录和检查点分别增加了111%和330%。

{"title":"Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory","authors":"Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin","doi":"10.1109/PACT.2017.58","DOIUrl":"https://doi.org/10.1109/PACT.2017.58","url":null,"abstract":"Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Nexus: A New Approach to Replication in Distributed Shared Caches Nexus:分布式共享缓存中复制的新方法

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.42

Po-An Tsai, Nathan Beckmann, Daniel Sánchez

Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.

最后一级缓存越来越分散，由许多小银行组成。为了表现良好，大多数访问必须由靠近请求核心的银行提供服务。一种有吸引力的方法是复制只读数据，以便在附近有副本可用。但是复制在容量和延迟之间引入了一个微妙的权衡:太少的复制迫使核心访问遥远的银行，而太多的复制浪费缓存空间并导致过多的芯片外丢失。工作负载所需的复制量差异很大，因此需要一种自适应方法。以前的自适应复制技术只复制每个块的本地库中的数据，因此它们关注于选择要复制的数据。不幸的是，未复制的数据仍然需要遍历整个网络，从而限制了这些技术的性能。我们认为，更好的策略是让核心共享副本，而自适应方案应侧重于选择复制的数量(即，在芯片上有多少副本)。这个想法充分利用了延迟与容量之间的权衡，实现了比以前的自适应复制技术在质量上更高的性能。它可以应用于许多先前的缓存组织，我们在两个例子中进行了演示:Nexus-R扩展了R-NUCA, Nexus-J扩展了Jigsaw。我们对运行在144核芯片上的HPC和服务器工作负载上的Nexus进行了评估，它优于先前的自适应复制方案，在所有对复制敏感的工作负载上，性能提高了高达90%，平均提高了23%。

{"title":"Nexus: A New Approach to Replication in Distributed Shared Caches","authors":"Po-An Tsai, Nathan Beckmann, Daniel Sánchez","doi":"10.1109/PACT.2017.42","DOIUrl":"https://doi.org/10.1109/PACT.2017.42","url":null,"abstract":"Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127101369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

POSTER: Improving NUMA System Efficiency with a Utilization-Based Co-scheduling 海报:利用基于利用率的协同调度提高NUMA系统效率

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.27

Younghyun Cho, Camilo A. Celis Guzman, Bernhard Egger

This work proposes a co-scheduling technique for co-located parallel applications on Non-Uniform Memory Access (NUMA) multi-socket multi-core platforms. The technique allocates core resources for running parallel applications such that both the utilization of the memory controllers and the CPU cores are maximized. Utilization is predicted using an online performance prediction model based on queuing systems. At runtime, the core allocation is periodically re-evaluated and cores are re-assigned to executing applications. Experimental results show that the proposed co-scheduling technique is able to execute co-located parallel applications in significantly less total execution time than the default Linux scheduler and a conventional scalability-based scheduler.

本文提出了一种在非统一内存访问(NUMA)多套接字多核平台上共定位并行应用程序的协同调度技术。该技术为运行并行应用程序分配核心资源，从而使内存控制器和CPU核心的利用率都得到最大化。利用基于排队系统的在线性能预测模型预测利用率。在运行时，会定期重新评估内核分配，并将内核重新分配给正在执行的应用程序。实验结果表明，与默认的Linux调度器和传统的基于可伸缩性的调度器相比，所提出的协同调度技术能够在更短的总执行时间内执行共存的并行应用程序。

引用次数: 0

POSTER: The Liberation Day of Nondeterministic Programs 海报:非确定性程序解放日

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.26

E. A. Deiana, Vincent St-Amour, P. Dinda, N. Hardavellas, Simone Campanoni

The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call "state dependences" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of "state dependences". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

对线程级并行性(TLP)的需求是无止境的，特别是在普通处理器上，因为TLP对于获得性能至关重要。然而，当今程序的TLP受到必须在运行时满足的依赖项的限制。我们发现，对于不确定的程序，其中一些实际的依赖关系可以用并行生成的替代数据来满足，从而提高了程序的TLP。我们展示了这些依赖关系(我们称之为“状态依赖关系”，因为它们与程序的状态相关)是如何使用特定于算法的知识来利用的。为了演示我们技术的实用性，我们实现了一个名为april25的系统，它包含了“状态依赖”的概念。该系统将五个不确定的多线程PARSEC基准测试的性能提高了100.5%。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀