2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献_第4页

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit 虚拟线程:最大化线程级并行超越GPU调度限制

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001201

M. Yoon, Keunsoo Kim, Sangpil Lee, W. Ro, M. Annavaram

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

现代gpu需要数以万计的并发线程才能充分利用海量的处理资源。然而，gpu中的线程并发性可能会由于线程调度结构(调度限制)的不足(如可用的程序计数器和单指令多线程堆栈)或片上内存(容量限制)的不足(如寄存器文件和共享内存)而减少。我们的评估表明，在实践中，在gpu上运行的许多通用应用程序中的并发性受到调度限制而不是容量限制的限制。在不过度增加调度复杂度的前提下，最大限度地利用片上存储资源是本文的主要目标。本文提出了一种虚拟线程(VT)体系结构，该体系结构将协作线程阵列(cta)分配到最大容量限制，而忽略调度限制。然而，为了降低并发管理更多线程的逻辑复杂性，我们建议将cta置于活动和非活动状态，这样活动cta的数量仍然尊重调度限制。当活动CTA中的所有翘曲遇到长延迟失速时，活动CTA将上下文切换出来，下一个就绪CTA将取代它的位置。我们利用了活动和非活动CTA仍然符合容量限制的事实，从而避免了保存和恢复大量CTA状态的需要。因此，VT显著降低了CTA交换的性能损失。通过在活动状态和非活动状态之间进行交换，VT可以在不增加逻辑复杂性的情况下利用更高程度的线程级并行性。仿真结果表明，VT平均提高了23.9%的性能。

{"title":"Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit","authors":"M. Yoon, Keunsoo Kim, Sangpil Lee, W. Ro, M. Annavaram","doi":"10.1145/3007787.3001201","DOIUrl":"https://doi.org/10.1145/3007787.3001201","url":null,"abstract":"Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"86 1","pages":"609-621"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80639541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Morpheus:高效地为异构计算创建应用对象

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001143

Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, S. Swanson

In high performance computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a coprocessor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.

在高性能计算系统中，对象反序列化可能成为一个非常重要的瓶颈——在我们的测试中，一组通用的、高度并行化的应用程序花费了总执行时间的64%将数据反序列化为对象。本文提出了Morpheus模型，该模型允许应用程序将此类计算移动到存储设备上。我们使用该模型将数据反序列化到存储设备内的应用程序对象中，而不是在主机CPU中。使用Morpheus模型进行对象反序列化可以避免不必要的系统开销，为计算密集型工作负载释放稀缺的CPU和主内存资源，节省I/O带宽，并降低功耗。在异构，协处理器装备的系统中，Morpheus允许应用程序对象通过点对点传输直接从存储设备发送到协处理器(例如，GPU)，进一步提高应用程序性能，并降低CPU和主内存的利用率。本文实现了一种支持Morpheus模型的SSD——Morpheus-SSD。Morpheus-SSD提高了1.66倍的对象反序列化性能，降低了7%的功耗，减少了42%的能耗，总执行时间提高了1.32倍。通过使用NVMe-P2P技术实现Morpheus-SSD与GPU之间的点对点通信，在异构计算平台上，Morpheus-SSD的总执行时间提高了1.39倍。

{"title":"Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing","authors":"Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, S. Swanson","doi":"10.1145/3007787.3001143","DOIUrl":"https://doi.org/10.1145/3007787.3001143","url":null,"abstract":"In high performance computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a coprocessor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"97 1","pages":"53-65"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80520410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems 利用动态时序松弛提高超低功耗嵌入式系统的能效

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001208

Hari Cherupalli, Rakesh Kumar, J. Sartori

Many emerging applications such as the internet of things, wearables, and sensor networks have ultra-low-power requirements. At the same time, cost and programmability considerations dictate that many of these applications will be powered by general purpose embedded microprocessors and microcontrollers, not ASICs. In this paper, we exploit a new opportunity for improving energy efficiency in ultralow-power processors expected to drive these applications -- dynamic timing slack. Dynamic timing slack exists when an embedded software application executed on a processor does not exercise the processor's static critical paths. In such scenarios, the longest path exercised by the application has additional timing slack which can be exploited for power savings at no performance cost by scaling down the processor's voltage at the same frequency until the longest exercised paths just meet timing constraints. Paths that cannot be exercised by an application can safely be allowed to violate timing constraints. We show that dynamic timing slack exists for many ultra-low-power applications and that exploiting dynamic timing slack can result in significant power savings for any ultra-low-power processors. We also present an automated methodology for identifying dynamic timing slack and selecting a safe operating point for a processor and a particular embedded software. Our approach for identifying and exploiting dynamic timing slack is non-speculative, requires no programmer intervention and little or no hardware support, and demonstrates potential power savings of up to 32%, 25% on average, over a range of embedded applications running on a common ultra-low-power processor, at no performance cost.

许多新兴应用，如物联网、可穿戴设备和传感器网络，都有超低功耗要求。同时，考虑到成本和可编程性，这些应用程序中的许多将由通用嵌入式微处理器和微控制器提供支持，而不是asic。在本文中，我们利用了一个新的机会来提高超低功耗处理器的能源效率，有望推动这些应用——动态时序松弛。当在处理器上执行的嵌入式软件应用程序不执行处理器的静态关键路径时，就存在动态时序松弛。在这种情况下，应用程序运行的最长路径具有额外的时序松弛，可以通过在相同频率下降低处理器电压，直到最长路径刚好满足时序约束，从而在不增加性能成本的情况下节省电力。应用程序不能执行的路径可以被允许违反时间约束。我们表明，动态定时松弛存在于许多超低功耗应用程序中，并且利用动态定时松弛可以为任何超低功耗处理器带来显着的功耗节省。我们还提出了一种用于识别动态时序松弛和为处理器和特定嵌入式软件选择安全工作点的自动化方法。我们识别和利用动态时间空闲的方法是非推测性的，不需要程序员干预，很少或根本不需要硬件支持，并且在不降低性能成本的情况下，在一个普通超低功耗处理器上运行的一系列嵌入式应用程序中，可以节省高达32%，平均25%的电力。

{"title":"Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems","authors":"Hari Cherupalli, Rakesh Kumar, J. Sartori","doi":"10.1145/3007787.3001208","DOIUrl":"https://doi.org/10.1145/3007787.3001208","url":null,"abstract":"Many emerging applications such as the internet of things, wearables, and sensor networks have ultra-low-power requirements. At the same time, cost and programmability considerations dictate that many of these applications will be powered by general purpose embedded microprocessors and microcontrollers, not ASICs. In this paper, we exploit a new opportunity for improving energy efficiency in ultralow-power processors expected to drive these applications -- dynamic timing slack. Dynamic timing slack exists when an embedded software application executed on a processor does not exercise the processor's static critical paths. In such scenarios, the longest path exercised by the application has additional timing slack which can be exploited for power savings at no performance cost by scaling down the processor's voltage at the same frequency until the longest exercised paths just meet timing constraints. Paths that cannot be exercised by an application can safely be allowed to violate timing constraints. We show that dynamic timing slack exists for many ultra-low-power applications and that exploiting dynamic timing slack can result in significant power savings for any ultra-low-power processors. We also present an automated methodology for identifying dynamic timing slack and selecting a safe operating point for a processor and a particular embedded software. Our approach for identifying and exploiting dynamic timing slack is non-speculative, requires no programmer intervention and little or no hardware support, and demonstrates potential power savings of up to 32%, 25% on average, over a range of embedded applications running on a common ultra-low-power processor, at no performance cost.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"26 3 1","pages":"671-681"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83600514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

XED: Exposing On-Die Error Detection Information for Strong Memory Reliability XED:暴露芯片上的错误检测信息以提高内存可靠性

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001174

Prashant J. Nair, Vilas Sridharan, Moinuddin K. Qureshi

Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller. This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined “catch-word” instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172× higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.

大粒度内存故障仍然是系统可靠性的一个关键障碍。更糟糕的是，随着DRAM扩展到更小的节点，DRAM芯片中不可靠位的频率继续增加。为了减轻这种与扩展相关的故障，内存供应商正计划为现有的DRAM芯片配备片上ECC。为了保持与内存标准的兼容性，On-Die ECC对内存控制器是不可见的。本文探讨了在片上ECC存在的情况下，如何设计高可靠性的存储系统。我们表明，如果片上ECC不暴露于内存系统，与8片非ECC DIMM相比，拥有9片ECC-DIMM(实现SECDED)几乎没有可靠性优势。我们还表明，如果片上ECC的错误检测可以暴露给内存控制器，那么即使使用9片ECC- dimm也可以实现芯片杀伤级可靠性。为此，我们提出了eXposed On-Die Error Detection (XED)，它在不需要更改内存标准或消耗带宽开销的情况下暴露了On-Die错误检测信息。当片上ECC检测到错误时，XED传输一个预定义的“流行语”而不是纠正的数据值。在接收到口号时，内存控制器使用ECC-DIMM的9芯片中存储的奇偶校验来纠正故障芯片(类似于RAID-3)。我们的研究表明，XED提供了Chipkill级别的可靠性(比SECDED高172倍)，同时产生的开销可以忽略不计，执行时间比Chipkill低21%。我们还展示了XED可以使Chipkill系统提供Double-Chipkill级别的可靠性，同时避免了相关的存储、性能和功耗开销。

{"title":"XED: Exposing On-Die Error Detection Information for Strong Memory Reliability","authors":"Prashant J. Nair, Vilas Sridharan, Moinuddin K. Qureshi","doi":"10.1145/3007787.3001174","DOIUrl":"https://doi.org/10.1145/3007787.3001174","url":null,"abstract":"Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller. This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined “catch-word” instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172× higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"18 1","pages":"341-353"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83744021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

Production-Run Software Failure Diagnosis via Adaptive Communication Tracking 基于自适应通信跟踪的生产运行软件故障诊断

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001175

Mohammad Mejbah Ul Alam, A. Muzahid

Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.

软件故障诊断技术通过在生产运行时对某些事件进行采样或使用某些错误检测算法来工作。有些技术需要多次重现失败。当执行平台、环境或代码发生变化时，那些不需要这样做的代码就不能足够适应。我们提出了一种基于神经硬件的机器智能的生产运行故障诊断技术ACT。ACT使用神经硬件实时学习一些不变量(例如，数据通信不变量)，并记录任何潜在的违反。由于ACT可以动态地学习不变量，因此它可以适应执行设置或代码中的任何更改。由于它只记录可能违反的不变量，因此后处理阶段可以相当准确地查明根本原因，而无需再次观察故障。ACT可以无缝地解决许多顺序和并发错误。本文给出了一个典型的多处理器系统中ACT的详细设计和实现。对于部分可配置的单隐层神经网络，采用三级管道。我们已经在各种流行的基准测试和开源程序中对ACT进行了评估。ACT对这些程序中的16个错误进行了准确的排序诊断。与现有的基于学习和抽样的方法相比，ACT具有更好的诊断能力。对于默认配置，ACT的平均执行开销为8.2%。

{"title":"Production-Run Software Failure Diagnosis via Adaptive Communication Tracking","authors":"Mohammad Mejbah Ul Alam, A. Muzahid","doi":"10.1145/3007787.3001175","DOIUrl":"https://doi.org/10.1145/3007787.3001175","url":null,"abstract":"Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"94 1","pages":"354-366"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90378941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Evaluation of an Analog Accelerator for Linear Algebra 线性代数模拟加速器的评价

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001197

Yipeng Huang, Ning Guo, Mingoo Seok, Y. Tsividis, S. Sethumadhavan

Due to the end of supply voltage scaling and the increasing percentage of dark silicon in modern integrated circuits, researchers are looking for new scalable ways to get useful computation from existing silicon technology. In this paper we present a reconfigurable analog accelerator for solving systems of linear equations. Commonly perceived downsides of analog computing, such as low precision and accuracy, limited problem sizes, and difficulty in programming are all compensated for using methods we discuss. Based on a prototyped analog accelerator chip we compare the performance and energy consumption of the analog solver against an efficient digital algorithm running on a CPU, and find that the analog accelerator approach may be an order of magnitude faster and provide one third energy savings, depending on the accelerator design. Due to the speed and efficiency of linear algebra algorithms running on digital computers, an analog accelerator that matches digital performance needs a large silicon footprint. Finally, we conclude that problem classes outside of systems of linear equations may hold more promise for analog acceleration.

由于电源电压缩放的终结和现代集成电路中暗硅的比例的增加，研究人员正在寻找新的可扩展的方法来从现有的硅技术中获得有用的计算。本文提出了一种用于求解线性方程组的可重构模拟加速器。模拟计算常见的缺点，如低精度和准确性，有限的问题规模，以及编程困难，都可以通过使用我们讨论的方法来弥补。基于原型模拟加速器芯片，我们将模拟求解器的性能和能耗与在CPU上运行的高效数字算法进行了比较，发现模拟加速器方法可能快一个数量级，并提供三分之一的节能，具体取决于加速器设计。由于在数字计算机上运行的线性代数算法的速度和效率，与数字性能相匹配的模拟加速器需要大量的硅足迹。最后，我们得出结论，线性方程系统之外的问题类可能对模拟加速更有希望。

{"title":"Evaluation of an Analog Accelerator for Linear Algebra","authors":"Yipeng Huang, Ning Guo, Mingoo Seok, Y. Tsividis, S. Sethumadhavan","doi":"10.1145/3007787.3001197","DOIUrl":"https://doi.org/10.1145/3007787.3001197","url":null,"abstract":"Due to the end of supply voltage scaling and the increasing percentage of dark silicon in modern integrated circuits, researchers are looking for new scalable ways to get useful computation from existing silicon technology. In this paper we present a reconfigurable analog accelerator for solving systems of linear equations. Commonly perceived downsides of analog computing, such as low precision and accuracy, limited problem sizes, and difficulty in programming are all compensated for using methods we discuss. Based on a prototyped analog accelerator chip we compare the performance and energy consumption of the analog solver against an efficient digital algorithm running on a CPU, and find that the analog accelerator approach may be an order of magnitude faster and provide one third energy savings, depending on the accelerator design. Due to the speed and efficiency of linear algebra algorithms running on digital computers, an analog accelerator that matches digital performance needs a large silicon footprint. Finally, we conclude that problem classes outside of systems of linear equations may hold more promise for analog acceleration.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"570-582"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78076867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement 回到未来:利用Belady的算法改进缓存替换

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001146

Akanksha Jain, Calvin Lin

Belady's algorithm is optimal but infeasible because it requires knowledge of the future. This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions. We show that the implementation is surprisingly efficient, as we introduce a new method of efficiently simulating Belady's behavior, and we use known sampling techniques to compactly represent the long history information that is needed for high accuracy. For a 2MB LLC, our solution uses a 16KB hardware budget (excluding replacement state in the tag array). When applied to a memory-intensive subset of the SPEC 2006 CPU benchmarks, our solution improves performance over LRU by 8.4%, as opposed to 6.2% for the previous state-of-the-art. For a 4-core system with a shared 8MB LLC, our solution improves performance by 15.0%, compared to 12.0% for the previous state-of-the-art.

Belady的算法是最优的，但不可行，因为它需要对未来的了解。本文解释了缓存替换算法如何从Belady算法中学习，将其应用于过去的缓存访问，从而为未来的缓存替换决策提供信息。我们证明了实现是惊人的高效，因为我们引入了一种新的方法来有效地模拟Belady的行为，并且我们使用已知的采样技术来紧凑地表示高精度所需的长历史信息。对于2MB的LLC，我们的解决方案使用16KB的硬件预算(不包括标签数组中的替换状态)。当应用于SPEC 2006 CPU基准的内存密集型子集时，我们的解决方案比LRU提高了8.4%的性能，而之前的最先进的解决方案只提高了6.2%。对于具有共享8MB LLC的4核系统，我们的解决方案将性能提高了15.0%，而之前最先进的解决方案仅提高了12.0%。

引用次数: 152

ActivePointers: A Case for Software Address Translation on GPUs activepointer:一个在gpu上进行软件地址转换的案例

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001200

Sagi Shahar, Shai Bergman, M. Silberstein

Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.

现代离散gpu已经成为加速计算密集型应用程序的首选处理器，但在大规模数据处理中使用它们是极具挑战性的。不幸的是，它们没有提供在CPU上下文中建立的重要的I/O抽象，例如内存映射文件，它使程序员避免了缓冲区和I/O设备管理的复杂性。然而，在GPU上实现这些抽象带来了一个问题:有限的GPU虚拟内存系统没有为GPU开发人员提供地址空间管理和页面错误处理机制，并且不允许修改运行GPU程序的内存映射。我们实现了ActivePointers，这是一个软件地址转换层和分页系统，它为GPU程序引入了对页面错误和虚拟地址空间管理的本地支持，并能够在商品GPU上实现全功能的内存映射文件。映射到GPU内存中的文件是使用活动指针访问的，它的行为像常规指针一样，但是访问GPU页面缓存的底层，并触发在GPU上处理的页面错误。我们设计和评估了一些新的机制，包括硬件寄存器中的翻译缓存和翻译聚合，用于在单个warp中处理线程的无死锁页面错误。我们使用微基准测试对NVIDIA商用gpu上的activepointer进行了广泛的评估，并且还实现了一个复杂的图像处理应用程序，该应用程序从存储在40GB文件中的1000万张图像子集中构建照片拼贴。GPU实现将整个文件映射到GPU内存并通过活动指针访问它。活动指针的使用只增加了应用程序运行时的1%，而在CPU+GPU的组合实现中，速度提高了3.9倍，在使用AVX矢量指令的12核CPU实现中，速度提高了2.6倍。

{"title":"ActivePointers: A Case for Software Address Translation on GPUs","authors":"Sagi Shahar, Shai Bergman, M. Silberstein","doi":"10.1145/3007787.3001200","DOIUrl":"https://doi.org/10.1145/3007787.3001200","url":null,"abstract":"Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"33 1","pages":"596-608"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82279535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Automatic Generation of Efficient Accelerators for Reconfigurable Hardware 用于可重构硬件的高效加速器的自动生成

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001150

D. Koeplinger, R. Prabhakar, Yaqi Zhang, Christina Delimitrou, C. Kozyrakis, K. Olukotun

Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.

自定义数据路径形式的加速提供了比通用处理器更大的性能和能耗改进。可重构结构(如FPGA)在实现特定应用的加速器中越来越受欢迎，从而增加了拥有良好的高级FPGA设计工具的重要性。然而，目前针对fpga的工具对高级编程、资源估计和快速自动设计空间探索的支持不足。我们描述了一个解决这些挑战的设计框架。我们引入了一种新的硬件表示，使用参数化模板，在多个嵌套级别捕获局部性和并行性信息。这种表示被设计成基于并行模式从高级语言自动生成。我们描述了一种混合面积估计技术，该技术使用模板级模型和设计级人工神经网络来考虑硬件放置和路由工具的影响，包括路由开销、寄存器和块RAM重复以及LUT打包。我们的运行时估计考虑了片外内存访问。我们使用我们的估计能力来快速探索跨越瓷砖大小、并行化因素和可选的粗粒度流水线的大型设计空间，所有这些都在多个循环级别上。我们表明，对逻辑资源的估计平均误差为4.8%，对运行时的估计平均误差为6.1%，并且比商业高级综合工具快279到6533倍。我们将性能最好的设计与在服务器级6核处理器上运行的优化CPU代码进行比较，并显示速度高达16.7倍。

{"title":"Automatic Generation of Efficient Accelerators for Reconfigurable Hardware","authors":"D. Koeplinger, R. Prabhakar, Yaqi Zhang, Christina Delimitrou, C. Kozyrakis, K. Olukotun","doi":"10.1145/3007787.3001150","DOIUrl":"https://doi.org/10.1145/3007787.3001150","url":null,"abstract":"Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"39 1","pages":"115-127"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87226950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 98

PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory 基于rram的主存中神经网络计算的一种新的内存处理架构

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001140

Ping Chi, Shuangchen Li, Conglei Xu, Zhang Tao, Jishen Zhao, Yongpan Liu, Yu Wang, Yuan Xie

Processing-in-memory (PIM) is a promising solution to address the “memory wall” challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrixvector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360x and the energy consumption by ~895x, across the evaluated machine learning benchmarks.

内存处理(PIM)是解决未来计算机系统“内存墙”挑战的一种很有前途的解决方案。先前提出的PIM体系结构将额外的计算逻辑放在内存中或内存附近。新兴的金属氧化物电阻随机存取存储器(ReRAM)已显示出其在主存储器中的应用潜力。此外，由于其交叉棒阵列结构，ReRAM可以高效地进行矩阵向量乘法运算，因此在加速神经网络应用方面得到了广泛的研究。在这项工作中，我们提出了一种新的PIM架构，称为PRIME，以加速基于ReRAM的主存储器中的神经网络应用。在PRIME中，一部分ReRAM横条阵列可以配置为神经网络应用的加速器，也可以配置为更大内存空间的普通内存。我们提供微架构和电路设计，以使可变形的功能与一个微不足道的面积开销。我们还为软件开发人员设计了一个软件/硬件接口，以便在PRIME上实现各种神经网络。得益于PIM架构和使用ReRAM进行神经网络计算的效率，PRIME在神经网络加速方面有别于以往的工作，具有显着的性能改进和节能。我们的实验结果表明，在评估的机器学习基准测试中，与最先进的神经处理单元设计相比，PRIME的性能提高了约2360倍，能耗提高了约895倍。

{"title":"PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory","authors":"Ping Chi, Shuangchen Li, Conglei Xu, Zhang Tao, Jishen Zhao, Yongpan Liu, Yu Wang, Yuan Xie","doi":"10.1145/3007787.3001140","DOIUrl":"https://doi.org/10.1145/3007787.3001140","url":null,"abstract":"Processing-in-memory (PIM) is a promising solution to address the “memory wall” challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrixvector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360x and the energy consumption by ~895x, across the evaluated machine learning benchmarks.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"64 1","pages":"27-39"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78192792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1189