{"title":"内存数据流处理器","authors":"Daichi Fujiki, S. Mahlke, R. Das","doi":"10.1109/PACT.2017.53","DOIUrl":null,"url":null,"abstract":"Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DFG which operates on one element of a vector is referred to as a module by the compiler. The compiler transforms input DFG into a collection data-parallel modules. Modules which operate on same vectors belong to an Instruction Block (IB), and are run concurrently on memory arrays. Our compiler explores several interesting optimizations such as unrolling of high-dimensional tensors, maximizing ILP within a module, pipelining memory reads and writes, and minimizing communication between arrays. To create a programmable in-memory processor, we argue that variety of computation primitives need to be implemented by exploiting the analog computation capability of the ReRAM arrays. Thus we develop a general purpose ISA and design an memory array architecture which can support diverse operations. For instance, we show how to efficiently implement complex operations (such as division, transcendental functions, element-wise vector multiplication, etc) by using analog primitives in ReRAM memory arrays. Furthermore, we discuss efficient network/memory co-design for reduction operations, and scatter/gather operations. In summary, this poster we will present a programming framework, compiler, ISA, and architecture of our proposed general purpose in-memory data flow processor built out of resistive compute memories. Figure X shows our overall framework. We will also present our experimental results across micro-benchmarks and real-world benchmarks from PARSEC and Rodinia. Initial results demonstrate ∼800x and ∼100x speedup when compared to multi-core and GPU execution.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"In-memory Data Flow Processor\",\"authors\":\"Daichi Fujiki, S. Mahlke, R. Das\",\"doi\":\"10.1109/PACT.2017.53\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DFG which operates on one element of a vector is referred to as a module by the compiler. The compiler transforms input DFG into a collection data-parallel modules. Modules which operate on same vectors belong to an Instruction Block (IB), and are run concurrently on memory arrays. Our compiler explores several interesting optimizations such as unrolling of high-dimensional tensors, maximizing ILP within a module, pipelining memory reads and writes, and minimizing communication between arrays. To create a programmable in-memory processor, we argue that variety of computation primitives need to be implemented by exploiting the analog computation capability of the ReRAM arrays. Thus we develop a general purpose ISA and design an memory array architecture which can support diverse operations. For instance, we show how to efficiently implement complex operations (such as division, transcendental functions, element-wise vector multiplication, etc) by using analog primitives in ReRAM memory arrays. Furthermore, we discuss efficient network/memory co-design for reduction operations, and scatter/gather operations. In summary, this poster we will present a programming framework, compiler, ISA, and architecture of our proposed general purpose in-memory data flow processor built out of resistive compute memories. Figure X shows our overall framework. We will also present our experimental results across micro-benchmarks and real-world benchmarks from PARSEC and Rodinia. Initial results demonstrate ∼800x and ∼100x speedup when compared to multi-core and GPU execution.\",\"PeriodicalId\":438103,\"journal\":{\"name\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2017.53\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DFG which operates on one element of a vector is referred to as a module by the compiler. The compiler transforms input DFG into a collection data-parallel modules. Modules which operate on same vectors belong to an Instruction Block (IB), and are run concurrently on memory arrays. Our compiler explores several interesting optimizations such as unrolling of high-dimensional tensors, maximizing ILP within a module, pipelining memory reads and writes, and minimizing communication between arrays. To create a programmable in-memory processor, we argue that variety of computation primitives need to be implemented by exploiting the analog computation capability of the ReRAM arrays. Thus we develop a general purpose ISA and design an memory array architecture which can support diverse operations. For instance, we show how to efficiently implement complex operations (such as division, transcendental functions, element-wise vector multiplication, etc) by using analog primitives in ReRAM memory arrays. Furthermore, we discuss efficient network/memory co-design for reduction operations, and scatter/gather operations. In summary, this poster we will present a programming framework, compiler, ISA, and architecture of our proposed general purpose in-memory data flow processor built out of resistive compute memories. Figure X shows our overall framework. We will also present our experimental results across micro-benchmarks and real-world benchmarks from PARSEC and Rodinia. Initial results demonstrate ∼800x and ∼100x speedup when compared to multi-core and GPU execution.