内存数据流处理器

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2017-09-01 DOI:10.1109/PACT.2017.53

Daichi Fujiki, S. Mahlke, R. Das

{"title":"内存数据流处理器","authors":"Daichi Fujiki, S. Mahlke, R. Das","doi":"10.1109/PACT.2017.53","DOIUrl":null,"url":null,"abstract":"Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DFG which operates on one element of a vector is referred to as a module by the compiler. The compiler transforms input DFG into a collection data-parallel modules. Modules which operate on same vectors belong to an Instruction Block (IB), and are run concurrently on memory arrays. Our compiler explores several interesting optimizations such as unrolling of high-dimensional tensors, maximizing ILP within a module, pipelining memory reads and writes, and minimizing communication between arrays. To create a programmable in-memory processor, we argue that variety of computation primitives need to be implemented by exploiting the analog computation capability of the ReRAM arrays. Thus we develop a general purpose ISA and design an memory array architecture which can support diverse operations. For instance, we show how to efficiently implement complex operations (such as division, transcendental functions, element-wise vector multiplication, etc) by using analog primitives in ReRAM memory arrays. Furthermore, we discuss efficient network/memory co-design for reduction operations, and scatter/gather operations. In summary, this poster we will present a programming framework, compiler, ISA, and architecture of our proposed general purpose in-memory data flow processor built out of resistive compute memories. Figure X shows our overall framework. We will also present our experimental results across micro-benchmarks and real-world benchmarks from PARSEC and Rodinia. Initial results demonstrate ∼800x and ∼100x speedup when compared to multi-core and GPU execution.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"In-memory Data Flow Processor\",\"authors\":\"Daichi Fujiki, S. Mahlke, R. Das\",\"doi\":\"10.1109/PACT.2017.53\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DFG which operates on one element of a vector is referred to as a module by the compiler. The compiler transforms input DFG into a collection data-parallel modules. Modules which operate on same vectors belong to an Instruction Block (IB), and are run concurrently on memory arrays. Our compiler explores several interesting optimizations such as unrolling of high-dimensional tensors, maximizing ILP within a module, pipelining memory reads and writes, and minimizing communication between arrays. To create a programmable in-memory processor, we argue that variety of computation primitives need to be implemented by exploiting the analog computation capability of the ReRAM arrays. Thus we develop a general purpose ISA and design an memory array architecture which can support diverse operations. For instance, we show how to efficiently implement complex operations (such as division, transcendental functions, element-wise vector multiplication, etc) by using analog primitives in ReRAM memory arrays. Furthermore, we discuss efficient network/memory co-design for reduction operations, and scatter/gather operations. In summary, this poster we will present a programming framework, compiler, ISA, and architecture of our proposed general purpose in-memory data flow processor built out of resistive compute memories. Figure X shows our overall framework. We will also present our experimental results across micro-benchmarks and real-world benchmarks from PARSEC and Rodinia. Initial results demonstrate ∼800x and ∼100x speedup when compared to multi-core and GPU execution.\",\"PeriodicalId\":438103,\"journal\":{\"name\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2017.53\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

近年来，非易失性存储器(NVMs)的发展为内存计算开辟了一个新的领域。通过重新利用存储结构，某些nvm已被证明具有原位模拟计算能力。例如，电阻式存储器(reram)以氧化钛的电阻形式存储数据，并通过向字线注入电压并在位线上感应产生的电流，我们利用基尔霍夫定律获得输入电压和电池电导的点积。最近的作品通过利用这种点积功能，探索了基于ReRAM的机器学习算法加速器的设计空间[2]。这些基于ReRAM的加速器利用了大量的并行性和宽松的精度要求，尽管它们的读/写延迟很高，但与当前的CPU/GPU架构和定制asic相比，它们提供了数量级的改进。尽管计算型nvm提供了显著的性能提升，但以前的工作依赖于手动将工作负载映射到内存阵列，这使得很难为新的工作负载配置它。我们通过提出一个可编程内存处理器架构和编程框架来解决这个问题。该体系结构由分组为块的内存阵列和一个自定义互连组成，以促进阵列之间的通信。每个数组既是存储单元又是处理单元。所提出的内存处理器架构很简单。关键的挑战是开发一个编程框架和一个丰富的ISA，允许不同的数据并行程序利用底层的计算效率。所建议的内存处理器的效率来自两个来源。首先，大规模并行。nvm由数千个阵列组成。这些数组中的每一个都被转换成可以并发计算的alu。第二，减少数据移动，避免数据在内存和处理器内核之间变换。我们的目标是建立编程语义和执行模型，将ReRAM计算的上述优点暴露给通用数据并行程序。所建议的编程框架试图通过合并数据流和矢量处理(SIMD)的概念来暴露硬件中的底层并行性。数据流显式地暴露了程序中的指令级并行性(ILP)，而向量处理暴露了程序中的数据级并行性(DLP)。Google的TensorFlow[1]是一种流行的机器学习编程模型。我们观察到TensorFlow的编程语义是数据流和向量处理的完美结合。因此，我们提出的编程框架首先要求程序员使用TensorFlow编写程序。我们开发了一个TensorFlow编译器，它为我们的内存数据流处理器生成二进制代码。TensorFlow (TF)程序本质上是数据流图(DFG)，其中每个操作符节点都可以将张量作为操作数。对vector的一个元素进行操作的DFG被编译器称为模块。编译器将输入DFG转换成数据并行模块的集合。操作相同向量的模块属于指令块(IB)，并在内存数组上并发运行。我们的编译器探索了几个有趣的优化，如展开高维张量、最大化模块内的ILP、管道式内存读写以及最小化数组之间的通信。为了创建一个可编程的内存处理器，我们认为需要通过利用ReRAM阵列的模拟计算能力来实现各种计算原语。因此，我们开发了一个通用的ISA，并设计了一个支持多种操作的存储阵列架构。例如，我们展示了如何通过在ReRAM内存数组中使用模拟原语来有效地实现复杂操作(如除法、超越函数、元素向量乘法等)。此外，我们还讨论了约简操作和分散/收集操作的高效网络/内存协同设计。总之，在这张海报中，我们将展示一个编程框架、编译器、ISA和我们提出的通用内存数据流处理器的架构，该处理器是由电阻式计算内存构建的。图X显示了我们的整体框架。我们还将介绍我们在PARSEC和Rodinia的微基准测试和实际基准测试中的实验结果。初步结果表明，与多核和GPU执行相比，加速速度提高了~ 800倍和~ 100倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

In-memory Data Flow Processor

Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DFG which operates on one element of a vector is referred to as a module by the compiler. The compiler transforms input DFG into a collection data-parallel modules. Modules which operate on same vectors belong to an Instruction Block (IB), and are run concurrently on memory arrays. Our compiler explores several interesting optimizations such as unrolling of high-dimensional tensors, maximizing ILP within a module, pipelining memory reads and writes, and minimizing communication between arrays. To create a programmable in-memory processor, we argue that variety of computation primitives need to be implemented by exploiting the analog computation capability of the ReRAM arrays. Thus we develop a general purpose ISA and design an memory array architecture which can support diverse operations. For instance, we show how to efficiently implement complex operations (such as division, transcendental functions, element-wise vector multiplication, etc) by using analog primitives in ReRAM memory arrays. Furthermore, we discuss efficient network/memory co-design for reduction operations, and scatter/gather operations. In summary, this poster we will present a programming framework, compiler, ISA, and architecture of our proposed general purpose in-memory data flow processor built out of resistive compute memories. Figure X shows our overall framework. We will also present our experimental results across micro-benchmarks and real-world benchmarks from PARSEC and Rodinia. Initial results demonstrate ∼800x and ∼100x speedup when compared to multi-core and GPU execution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量