Pub Date : 2017-12-27DOI: 10.1109/TMSCS.2017.2787109
Yu Bai;Deliang Fan;Mingjie Lin
We propose an innovative stochastic-based computing architecture to implement low-power and robust artificial neural network (S-ANN) with both magnetic tunneling junction (MTJ) and Domain Wall (DW) devices. Our mixed-model HSPICE simulation results have shown that, for a well-known pattern recognition task, a 34-neuron S-ANN implementation achieves more than 1.5 orders of magnitude lower energy consumption and 2.5 orders of magnitude less hidden layer chip area, when compared with its deterministicbased ANN counterparts which are implemented with digital and analog CMOS circuits. We believe that our S-ANN architecture achieves such a remarkable performance gain by leveraging two key ideas. First, because all neural signals are encoded as random bit streams, the standard weighted-sum synapses can be accomplished by stochastic bit writing and reading procedure. Second, we designed and implemented a novel multiple-phase pumping circuit structure to effectively realize the soft-limiting neural transfer function that is essential to improve the overall ANN capability and reduce its network complexity.
{"title":"Stochastic-Based Synapse and Soft-Limiting Neuron with Spintronic Devices for Low Power and Robust Artificial Neural Networks","authors":"Yu Bai;Deliang Fan;Mingjie Lin","doi":"10.1109/TMSCS.2017.2787109","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2787109","url":null,"abstract":"We propose an innovative stochastic-based computing architecture to implement low-power and robust artificial neural network (S-ANN) with both magnetic tunneling junction (MTJ) and Domain Wall (DW) devices. Our mixed-model HSPICE simulation results have shown that, for a well-known pattern recognition task, a 34-neuron S-ANN implementation achieves more than 1.5 orders of magnitude lower energy consumption and 2.5 orders of magnitude less hidden layer chip area, when compared with its deterministicbased ANN counterparts which are implemented with digital and analog CMOS circuits. We believe that our S-ANN architecture achieves such a remarkable performance gain by leveraging two key ideas. First, because all neural signals are encoded as random bit streams, the standard weighted-sum synapses can be accomplished by stochastic bit writing and reading procedure. Second, we designed and implemented a novel multiple-phase pumping circuit structure to effectively realize the soft-limiting neural transfer function that is essential to improve the overall ANN capability and reduce its network complexity.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"463-476"},"PeriodicalIF":0.0,"publicationDate":"2017-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2787109","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-22DOI: 10.1109/TMSCS.2017.2748122
Andrew D. Brown;John E. Chad;Raihaan Kamarudin;Kier J. Dugan;Stephen B. Furber
SpiNNaker (Spiking Neural Network Architecture) is a specialized computing engine, intended for real-time simulation of neural systems. It consists of a mesh of 240x240 nodes, each containing 18 ARM9 processors: over a million cores, communicating via a bespoke network. Ultimately, the machine will support the simulation of up to a billion neurons in real time, allowing simulation experiments to be taken to hitherto unattainable scales. The architecture achieves this by ignoring three of the axioms of computer design: the communication fabric is non-deterministic; there is no global core synchronisation, and the system state-held in distributed memory-is not coherent. Time models itself: there is no notion of computed simulation time-wallclock time is simulation time. Whilst these design decisions are orthogonal to conventional wisdom, they bring the engine behavior closer to its intended simulation target-neural systems. We describe how SpiNNaker simulates large neural ensembles; we provide performance figures and outline some failure mechanisms. SpiNNaker simulation time scales 1:1 with wallclock time at least up to nine million synaptic connections on a 768 core subsystem (~1400th of the full system) to accurately produce logically predicted results.
{"title":"SpiNNaker: Event-Based Simulation—Quantitative Behavior","authors":"Andrew D. Brown;John E. Chad;Raihaan Kamarudin;Kier J. Dugan;Stephen B. Furber","doi":"10.1109/TMSCS.2017.2748122","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2748122","url":null,"abstract":"SpiNNaker (Spiking Neural Network Architecture) is a specialized computing engine, intended for real-time simulation of neural systems. It consists of a mesh of 240x240 nodes, each containing 18 ARM9 processors: over a million cores, communicating via a bespoke network. Ultimately, the machine will support the simulation of up to a billion neurons in real time, allowing simulation experiments to be taken to hitherto unattainable scales. The architecture achieves this by ignoring three of the axioms of computer design: the communication fabric is non-deterministic; there is no global core synchronisation, and the system state-held in distributed memory-is not coherent. Time models itself: there is no notion of computed simulation time-wallclock time is simulation time. Whilst these design decisions are orthogonal to conventional wisdom, they bring the engine behavior closer to its intended simulation target-neural systems. We describe how SpiNNaker simulates large neural ensembles; we provide performance figures and outline some failure mechanisms. SpiNNaker simulation time scales 1:1 with wallclock time at least up to nine million synaptic connections on a 768 core subsystem (~1400th of the full system) to accurately produce logically predicted results.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"450-462"},"PeriodicalIF":0.0,"publicationDate":"2017-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2748122","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-17DOI: 10.1109/TMSCS.2017.2773488
S. Karen Khatamifard;Ismail Akturk;Ulya R. Karpuzcu
Each synchronization point represents a point of serialization, and thereby can easily hurt parallel scalability. As demonstrated by recent studies, approximating, i.e., relaxing synchronization by eliminating a subset of synchronization points spatio-temporally can help improve parallel scalability, as long as approximation incurred violations of basic execution semantics remain predictable and controllable. Even if the divergence from fully-synchronized execution renders lower computation accuracy ratherthan catastrophic program termination, for approximation to be viable, the accuracy loss must be bounded. In this paper, we assess the viability of approximate synchronization using Speculative Lock Elision (SLE), which was adopted by hardware transactional memory implementations from industry, as a baseline for comparison. Specifically, we investigate the efficacy of exploiting semantic and temporal characteristics of critical sections in preventing excessive loss in computation accuracy, and devise a light-weight, proof-of-concept Approximate Speculative Lock Elision (ASLE) implementation, which exploits existing hardware support for SLE.
{"title":"On Approximate Speculative Lock Elision","authors":"S. Karen Khatamifard;Ismail Akturk;Ulya R. Karpuzcu","doi":"10.1109/TMSCS.2017.2773488","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2773488","url":null,"abstract":"Each synchronization point represents a point of serialization, and thereby can easily hurt parallel scalability. As demonstrated by recent studies, approximating, i.e., relaxing synchronization by eliminating a subset of synchronization points spatio-temporally can help improve parallel scalability, as long as approximation incurred violations of basic execution semantics remain predictable and controllable. Even if the divergence from fully-synchronized execution renders lower computation accuracy ratherthan catastrophic program termination, for approximation to be viable, the accuracy loss must be bounded. In this paper, we assess the viability of approximate synchronization using Speculative Lock Elision (SLE), which was adopted by hardware transactional memory implementations from industry, as a baseline for comparison. Specifically, we investigate the efficacy of exploiting semantic and temporal characteristics of critical sections in preventing excessive loss in computation accuracy, and devise a light-weight, proof-of-concept Approximate Speculative Lock Elision (ASLE) implementation, which exploits existing hardware support for SLE.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"141-151"},"PeriodicalIF":0.0,"publicationDate":"2017-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2773488","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68025087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The implementation and optimization of dynamic dataflow programs on multi/many-core platforms require solving a very difficult problem: how to partition and schedule the processing elements and dimension their interconnecting buffers according to given optimization functions in terms of throughput, memory usage, and energy consumption. This problem is NP-hard even for two cores. Thus, finding a close-to-optimal solution consists of exploring the design space by appropriate heuristics identifying those design points that maximize or minimize the desired (multiple) objective functions subject to a set of constraints. In general, exploring the design space efficiently is a challenging task due to the massive number of admissible design points. Efficient estimation methodologies are necessary to support an effective search of the design space by reducing to a minimum the cost and the number of measurements on the physical platform. This paper presents a new methodology that provides high-precision estimations of dynamic dataflow programs performances on multi/many-core platforms for any set of design configurations. The estimations rely on the execution trace post-processing obtained by a single profiled execution of the program. Furthermore, the paper describes the estimation methodology, implementation tools, and the type of information that is obtained from many/multi-core dataflow executions and used to drive the optimization heuristics. The results confirm a high level of accuracy achieved on different types of platforms and the effectiveness of the illustrated design space exploration methodology.
{"title":"High-Precision Performance Estimation for the Design Space Exploration of Dynamic Dataflow Programs","authors":"Małgorzata Michalska;Simone Casale-Brunet;Endri Bezati;Marco Mattavelli","doi":"10.1109/TMSCS.2017.2774294","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2774294","url":null,"abstract":"The implementation and optimization of dynamic dataflow programs on multi/many-core platforms require solving a very difficult problem: how to partition and schedule the processing elements and dimension their interconnecting buffers according to given optimization functions in terms of throughput, memory usage, and energy consumption. This problem is NP-hard even for two cores. Thus, finding a close-to-optimal solution consists of exploring the design space by appropriate heuristics identifying those design points that maximize or minimize the desired (multiple) objective functions subject to a set of constraints. In general, exploring the design space efficiently is a challenging task due to the massive number of admissible design points. Efficient estimation methodologies are necessary to support an effective search of the design space by reducing to a minimum the cost and the number of measurements on the physical platform. This paper presents a new methodology that provides high-precision estimations of dynamic dataflow programs performances on multi/many-core platforms for any set of design configurations. The estimations rely on the execution trace post-processing obtained by a single profiled execution of the program. Furthermore, the paper describes the estimation methodology, implementation tools, and the type of information that is obtained from many/multi-core dataflow executions and used to drive the optimization heuristics. The results confirm a high level of accuracy achieved on different types of platforms and the effectiveness of the illustrated design space exploration methodology.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"127-140"},"PeriodicalIF":0.0,"publicationDate":"2017-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2774294","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68025090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-14DOI: 10.1109/TMSCS.2017.2773523
Christopher H. Bennett;Jean-Etienne Lorival;Francois Marc;Théo Cabaret;Bruno Jousselme;Vincent Derycke;Jacques-Olivier Klein;Cristell Maneux
Organic memristors are promising molecular electronic devices for neuro-inspired on-chip learning applications. In this paper, we present a numerically efficient compact model suitable for $Fe(bpy)_3^{2+}$