The hardware random number generator (RNG) integrated in STM32 MCUs is intended to ensure that the numbers it generates cannot be guessed with a probability higher than a random guess. The RNG is based on several ring oscillators whose outputs are combined and post-processed to produce a 32-bit random number per round of computation. In this paper, we show that it is possible to train a neural network capable of recovering the Hamming weight of these random numbers from power traces with a higher than 60% probability. This is a 4-fold improvement over the 14% probability of the most likely Hamming weight.
{"title":"Side-Channel Analysis of the Random Number Generator in STM32 MCUs","authors":"Kalle Ngo, E. Dubrova","doi":"10.1145/3526241.3530324","DOIUrl":"https://doi.org/10.1145/3526241.3530324","url":null,"abstract":"The hardware random number generator (RNG) integrated in STM32 MCUs is intended to ensure that the numbers it generates cannot be guessed with a probability higher than a random guess. The RNG is based on several ring oscillators whose outputs are combined and post-processed to produce a 32-bit random number per round of computation. In this paper, we show that it is possible to train a neural network capable of recovering the Hamming weight of these random numbers from power traces with a higher than 60% probability. This is a 4-fold improvement over the 14% probability of the most likely Hamming weight.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123816305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adam Z. Foshie, Charles Rizzo, Hritom Das, Chaohui Zheng, J. Plank, G. Rose
Neuromorphic computing is a leading option for non von-Neumann computing architectures. With it, neural networks are developed that derive architectural inspiration from how the brain operates with neurons, synapses, and spikes. These networks are often implemented in either software or hardware based neuroprocessors designed to handle specific tasks efficiently. Even if implemented in hardware, software emulation is instrumental in determining the worthwhile features and capabilities of the architecture. In this work two novel neuroprocessors are introduced: the software-based RISP neuroprocessor, and the RAVENS hardware neuroprocessor. Several benchmark tests using control applications are performed with each neuroprocessor configured in various ways to evaluate their comparative performance and training properties.
{"title":"Benchmark Comparisons of Spike-based Reconfigurable Neuroprocessor Architectures for Control Applications","authors":"Adam Z. Foshie, Charles Rizzo, Hritom Das, Chaohui Zheng, J. Plank, G. Rose","doi":"10.1145/3526241.3530381","DOIUrl":"https://doi.org/10.1145/3526241.3530381","url":null,"abstract":"Neuromorphic computing is a leading option for non von-Neumann computing architectures. With it, neural networks are developed that derive architectural inspiration from how the brain operates with neurons, synapses, and spikes. These networks are often implemented in either software or hardware based neuroprocessors designed to handle specific tasks efficiently. Even if implemented in hardware, software emulation is instrumental in determining the worthwhile features and capabilities of the architecture. In this work two novel neuroprocessors are introduced: the software-based RISP neuroprocessor, and the RAVENS hardware neuroprocessor. Several benchmark tests using control applications are performed with each neuroprocessor configured in various ways to evaluate their comparative performance and training properties.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127625374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Block RAMs (BRAMs) play an important role in modern heterogenous FPGAs, hence how to test them comprehensively and effectively becomes a major concern. On-chip Partial Bitstream Relocation (PBR) technique based on FPGA Dynamic Partial Reconfiguration (DPR) can decrease the time spent on configuring modules in FPGA while reducing the memory resources overhead for storing partial bitstreams of the reconfigurable modules. The previous PBR technique is difficult to be combined with BRAM test directly, because they are somehow tedious, unsuitable for large-scale design or limited to specific devices. Besides, the problem exists for BRAM testing is that fault model is still incomplete and testing algorithms need to be improved to achieve higher fault coverage. An Effective BRAM test method based on a novel PBR technique is proposed in this paper. Our test method establishes a complete fault model for BRAM and improves the testing algorithms for faults in BRAM ECC circuits and intra-word coupling faults in SRAM cells. On-board experiments are carried out with Xilinx xc7vx690t device, and 14 BRAM configurations are used to fully test BRAMs. In conjunction with the proposed PBR technique, the number of configurations can be reduced to 10, which leads to a 35.7% time saving.
{"title":"An Effective Test Method for Block RAMs in Heterogeneous FPGAs Based on a Novel Partial Bitstream Relocation Technique","authors":"Wei-Xi Xiong, Yanze Li, Changpeng Sun, Huanlin Luo, Jiafeng Liu, Jian Wang, Jinmei Lai, G. Qu","doi":"10.1145/3526241.3530317","DOIUrl":"https://doi.org/10.1145/3526241.3530317","url":null,"abstract":"Block RAMs (BRAMs) play an important role in modern heterogenous FPGAs, hence how to test them comprehensively and effectively becomes a major concern. On-chip Partial Bitstream Relocation (PBR) technique based on FPGA Dynamic Partial Reconfiguration (DPR) can decrease the time spent on configuring modules in FPGA while reducing the memory resources overhead for storing partial bitstreams of the reconfigurable modules. The previous PBR technique is difficult to be combined with BRAM test directly, because they are somehow tedious, unsuitable for large-scale design or limited to specific devices. Besides, the problem exists for BRAM testing is that fault model is still incomplete and testing algorithms need to be improved to achieve higher fault coverage. An Effective BRAM test method based on a novel PBR technique is proposed in this paper. Our test method establishes a complete fault model for BRAM and improves the testing algorithms for faults in BRAM ECC circuits and intra-word coupling faults in SRAM cells. On-board experiments are carried out with Xilinx xc7vx690t device, and 14 BRAM configurations are used to fully test BRAMs. In conjunction with the proposed PBR technique, the number of configurations can be reduced to 10, which leads to a 35.7% time saving.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130827623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 3B: VLSI for Machine Learning and Artifical Intelligence 1","authors":"J. Hu","doi":"10.1145/3542687","DOIUrl":"https://doi.org/10.1145/3542687","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133527153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. PrashanthH., R. SoujanyaS., Bindu G. Gowda, M. Rao
VLSI implementation of arithmetic functions are of high demand considering the rise in hardware realization of image and digital signal processing modules for various autonomous applications. The hardware implementation offers faster results and desirable outcome, but expecting the same design metrics in the form of power, footprint and delay on a tiny decision-making edge devices with limited resources needs design improvisation. Approximate computing promises to support the required hardware metrics in error resilient applications where the inexact output is not deviated much from the expected one, and decision made remains unchanged. Multiplier design blocks are heavily used in the multimedia functional chip, and introducing approximation in these blocks effectively benefits design metrics and chip cost of the developed system-on-chip(SoC). The proposed work attempts to design and use various sizes of approximate AND-OR re-coded compressors in the multiple reduction stages, along with various fast adders in the final addition stage of multiplier design. Further, design metrics and resources utilized for different multiplier designs were characterized in ASIC and FPGA synthesis flows respectively, along with their error statistics. Designed approximate multipliers were employed in Gaussian smoothing application to evaluate the quality-hardware resource trade-off of approximation
{"title":"Design and Evaluation of In-Exact Compressor based Approximate Multipliers","authors":"C. PrashanthH., R. SoujanyaS., Bindu G. Gowda, M. Rao","doi":"10.1145/3526241.3530320","DOIUrl":"https://doi.org/10.1145/3526241.3530320","url":null,"abstract":"VLSI implementation of arithmetic functions are of high demand considering the rise in hardware realization of image and digital signal processing modules for various autonomous applications. The hardware implementation offers faster results and desirable outcome, but expecting the same design metrics in the form of power, footprint and delay on a tiny decision-making edge devices with limited resources needs design improvisation. Approximate computing promises to support the required hardware metrics in error resilient applications where the inexact output is not deviated much from the expected one, and decision made remains unchanged. Multiplier design blocks are heavily used in the multimedia functional chip, and introducing approximation in these blocks effectively benefits design metrics and chip cost of the developed system-on-chip(SoC). The proposed work attempts to design and use various sizes of approximate AND-OR re-coded compressors in the multiple reduction stages, along with various fast adders in the final addition stage of multiplier design. Further, design metrics and resources utilized for different multiplier designs were characterized in ASIC and FPGA synthesis flows respectively, along with their error statistics. Designed approximate multipliers were employed in Gaussian smoothing application to evaluate the quality-hardware resource trade-off of approximation","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114343768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matrix inversion is critical in mathematics and scientific applications. Large-scale dense matrix inversion is especially challenging for modern computers due to its heavy dependency of matrix elements and the poor temporal data locality. In this paper, we propose a novel accelerator termed MI2D, which converts matrix inversion into regular matrix multiplications using 2-dimensional cross-tile operations and novel algorithms for efficient data reuse and computations. Our evaluations show that MI2D can be easily integrated with existing matrix engines in modern high-end CPU and NPU, and effectively improves matrix inversion with 2.7× speedup against Intel Skylake CPU, and 24× against NVIDIA RTX 2080 Ti.
{"title":"MI2D: Accelerating Matrix Inversion with 2-Dimensional Tile Manipulations","authors":"Lingfeng Chen, Tian Xia, Wenzhe Zhao, Pengju Ren","doi":"10.1145/3526241.3530314","DOIUrl":"https://doi.org/10.1145/3526241.3530314","url":null,"abstract":"Matrix inversion is critical in mathematics and scientific applications. Large-scale dense matrix inversion is especially challenging for modern computers due to its heavy dependency of matrix elements and the poor temporal data locality. In this paper, we propose a novel accelerator termed MI2D, which converts matrix inversion into regular matrix multiplications using 2-dimensional cross-tile operations and novel algorithms for efficient data reuse and computations. Our evaluations show that MI2D can be easily integrated with existing matrix engines in modern high-end CPU and NPU, and effectively improves matrix inversion with 2.7× speedup against Intel Skylake CPU, and 24× against NVIDIA RTX 2080 Ti.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132905356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RISC-V is a modern Instruction Set Architecture (ISA) that by its open nature in combination with a clean and modular design has enormous potential to become a game changer in the Internet of Things (IoT) era. Recently, SystemC-based Virtual Prototypes (VPs) have been introduced into the RISC-V ecosystem to lay the foundation for advanced industry-proven system-level use-cases. However, a VP-driven environment modeling and interaction has been mostly neglected in the RISC-V context. In this paper we propose such an extension to broaden the application domain for virtual prototyping in the RISC-V context. As foundation, we build upon the open source RISC-V VP available at GitHub. For visualization purposes of the environment we designed a Graphical User Interface (GUI) and designed appropriate libraries to offer hardware communication interfaces such as GPIO and SPI from the VP to an interactive environment model. Our approach is designed to be integrated with SystemC-based VPs that leverage a Transaction Level Modeling (TLM) communication system to prefer a speed optimized simulation. To show the practicability of an environment model, we provide a set of building blocks such as buttons, LEDs and an OLED display and configure them in two demonstration environments. Our evaluation with three different case-studies demonstrates the applicability of our approach in building virtual environments effectively and correctly in matching the real physical systems. To advance the RISC-V community and stimulate further research we provide our extended VP platform with the environment configuration and visualization toolbox as well as both case-studies as open source on GitHub as well.
{"title":"Advanced Environment Modeling and Interaction in an Open Source RISC-V Virtual Prototype","authors":"Pascal Pieper, V. Herdt, R. Drechsler","doi":"10.1145/3526241.3530374","DOIUrl":"https://doi.org/10.1145/3526241.3530374","url":null,"abstract":"RISC-V is a modern Instruction Set Architecture (ISA) that by its open nature in combination with a clean and modular design has enormous potential to become a game changer in the Internet of Things (IoT) era. Recently, SystemC-based Virtual Prototypes (VPs) have been introduced into the RISC-V ecosystem to lay the foundation for advanced industry-proven system-level use-cases. However, a VP-driven environment modeling and interaction has been mostly neglected in the RISC-V context. In this paper we propose such an extension to broaden the application domain for virtual prototyping in the RISC-V context. As foundation, we build upon the open source RISC-V VP available at GitHub. For visualization purposes of the environment we designed a Graphical User Interface (GUI) and designed appropriate libraries to offer hardware communication interfaces such as GPIO and SPI from the VP to an interactive environment model. Our approach is designed to be integrated with SystemC-based VPs that leverage a Transaction Level Modeling (TLM) communication system to prefer a speed optimized simulation. To show the practicability of an environment model, we provide a set of building blocks such as buttons, LEDs and an OLED display and configure them in two demonstration environments. Our evaluation with three different case-studies demonstrates the applicability of our approach in building virtual environments effectively and correctly in matching the real physical systems. To advance the RISC-V community and stimulate further research we provide our extended VP platform with the environment configuration and visualization toolbox as well as both case-studies as open source on GitHub as well.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116611085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Halima Najibi, A. Levisse, G. Ansaloni, Marina Zapater, David Atienza Alonso
Flow Cell Arrays (FCA) technology employs microchannels filled with an electrolytic fluid to concurrently provide cooling and power generation to integrated circuits (ICs). This solution is particularly appealing for Three-Dimensional Multi-Processor Systems-on-Chip (3D MPSoCs) realized in deeply scaled technologies, as their extreme power densities result in significant thermal and voltage supply challenges. FCAs provide them with extra power to boost performance. However, the dual effects of FCAs (cooling and power supply) have conflicting trends leading to a complex interplay between temperature, voltage stability, and performance. In this paper, we explore this trade-off by introducing a novel methodology that controls the operating frequency of computing components and the electrolytic coolant flow rate at run-time. Our strategy enables tangible performance gains while abiding by timing, voltage drop, and temperature constraints. We showcase its benefits by targeting a 4-layer 3D MPSoC, achieving up to 24% increase in the operating frequencies and resulting in application speedups of up to 17%, while reducing the costs related to FCA liquid pumping energy.
{"title":"Thermal and Power-Aware Run-time Performance Management of 3D MPSoCs with Integrated Flow Cell Arrays","authors":"Halima Najibi, A. Levisse, G. Ansaloni, Marina Zapater, David Atienza Alonso","doi":"10.1145/3526241.3530309","DOIUrl":"https://doi.org/10.1145/3526241.3530309","url":null,"abstract":"Flow Cell Arrays (FCA) technology employs microchannels filled with an electrolytic fluid to concurrently provide cooling and power generation to integrated circuits (ICs). This solution is particularly appealing for Three-Dimensional Multi-Processor Systems-on-Chip (3D MPSoCs) realized in deeply scaled technologies, as their extreme power densities result in significant thermal and voltage supply challenges. FCAs provide them with extra power to boost performance. However, the dual effects of FCAs (cooling and power supply) have conflicting trends leading to a complex interplay between temperature, voltage stability, and performance. In this paper, we explore this trade-off by introducing a novel methodology that controls the operating frequency of computing components and the electrolytic coolant flow rate at run-time. Our strategy enables tangible performance gains while abiding by timing, voltage drop, and temperature constraints. We showcase its benefits by targeting a 4-layer 3D MPSoC, achieving up to 24% increase in the operating frequencies and resulting in application speedups of up to 17%, while reducing the costs related to FCA liquid pumping energy.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125890841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitrios Garyfallou, Anastasis Vagenas, Charalampos Antoniadis, Y. Massoud, G. Stamoulis
With process technology scaling, accurate gate-level timing analysis becomes even more challenging. Highly resistive on-chip interconnects have an ever-increasing impact on timing, signals no longer resemble smooth saturated ramps, while gate-interconnect interdependencies are stronger. Moreover, efficiency is a serious concern since repeatedly invoking a signoff tool during incremental optimization of modern VLSI circuits has become a major bottleneck. In this paper, we introduce a novel machine learning approach for timing estimation of gate-level stages using current source models and the concept of multiple slew and effective capacitance values. First, we exploit a fast iterative algorithm for initial stage timing estimation and feature extraction, and then we employ four artificial neural networks to correlate the initial delay and slew estimates for both the driver and interconnect with golden SPICE results. Contrary to prior works, our method uses fewer and more accurate features to represent the stage, leading to more efficient models. Experimental evaluation on driver-interconnect stages implemented in 7 nm FinFET technology indicates that our method leads to 0.99% (0.90 ps) and 2.54% (2.59 ps) mean error against SPICE for stage delay and slew, respectively. Furthermore, it has a small memory footprint (1.27 MB) and performs 35× faster than a commercial signoff tool. Thus, it may be integrated into timing-driven optimization steps to provide signoff accuracy and expedite timing closure.
{"title":"Leveraging Machine Learning for Gate-level Timing Estimation Using Current Source Models and Effective Capacitance","authors":"Dimitrios Garyfallou, Anastasis Vagenas, Charalampos Antoniadis, Y. Massoud, G. Stamoulis","doi":"10.1145/3526241.3530343","DOIUrl":"https://doi.org/10.1145/3526241.3530343","url":null,"abstract":"With process technology scaling, accurate gate-level timing analysis becomes even more challenging. Highly resistive on-chip interconnects have an ever-increasing impact on timing, signals no longer resemble smooth saturated ramps, while gate-interconnect interdependencies are stronger. Moreover, efficiency is a serious concern since repeatedly invoking a signoff tool during incremental optimization of modern VLSI circuits has become a major bottleneck. In this paper, we introduce a novel machine learning approach for timing estimation of gate-level stages using current source models and the concept of multiple slew and effective capacitance values. First, we exploit a fast iterative algorithm for initial stage timing estimation and feature extraction, and then we employ four artificial neural networks to correlate the initial delay and slew estimates for both the driver and interconnect with golden SPICE results. Contrary to prior works, our method uses fewer and more accurate features to represent the stage, leading to more efficient models. Experimental evaluation on driver-interconnect stages implemented in 7 nm FinFET technology indicates that our method leads to 0.99% (0.90 ps) and 2.54% (2.59 ps) mean error against SPICE for stage delay and slew, respectively. Furthermore, it has a small memory footprint (1.27 MB) and performs 35× faster than a commercial signoff tool. Thus, it may be integrated into timing-driven optimization steps to provide signoff accuracy and expedite timing closure.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123677320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In-Memory Computing (IMC) technology has been considered to be a promising approach to solve well-known memory-wall challenge for data intensive applications. In this paper, we are the first to propose MnM, a novel IMC system with innovative architecture/circuit designs for fast and efficient Min/Max searching computation in emerging Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM). Our proposed SOT-MRAM based in-memory logic circuits are specially optimized to perform parallel, one-cycle XNOR logic that are heavily used in the Min/Max searching-in-memory algorithm. Our novel in-memory XNOR circuit also has an overhead of just two transistors per row when compared to most prior methodologies which typically use multiple sense amplifiers or complex CMOS logic gates. We also design all other required peripheral circuits for implementing complete Min/Max searching-in-MRAM computation. Our cross-layer comprehensive experiments on Dijkstra's algorithm and other sorting algorithms in real word datasets show that our MnM could achieve significant performance improvement over CPUs, GPUs, and other competing IMC platforms based on RRAM/MRAM/DRAM.
{"title":"MnM: A Fast and Efficient Min/Max Searching in MRAM","authors":"Amitesh Sridharan, Fan Zhang, Deliang Fan","doi":"10.1145/3526241.3530349","DOIUrl":"https://doi.org/10.1145/3526241.3530349","url":null,"abstract":"In-Memory Computing (IMC) technology has been considered to be a promising approach to solve well-known memory-wall challenge for data intensive applications. In this paper, we are the first to propose MnM, a novel IMC system with innovative architecture/circuit designs for fast and efficient Min/Max searching computation in emerging Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM). Our proposed SOT-MRAM based in-memory logic circuits are specially optimized to perform parallel, one-cycle XNOR logic that are heavily used in the Min/Max searching-in-memory algorithm. Our novel in-memory XNOR circuit also has an overhead of just two transistors per row when compared to most prior methodologies which typically use multiple sense amplifiers or complex CMOS logic gates. We also design all other required peripheral circuits for implementing complete Min/Max searching-in-MRAM computation. Our cross-layer comprehensive experiments on Dijkstra's algorithm and other sorting algorithms in real word datasets show that our MnM could achieve significant performance improvement over CPUs, GPUs, and other competing IMC platforms based on RRAM/MRAM/DRAM.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130254694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}