In-memory computing (IMC) has been proposed to overcome the von Neumann bottleneck in data-intensive applications. However, existing IMC solutions could not achieve both high parallelism and high flexibility, which limits their application in more general scenarios: As a highly parallel IMC design, the functionality of a MAC crossbar is limited to the matrix-vector multiplication; Another IMC method of logic-in-memory (LiM) is more flexible in supporting different logic functions, but has low parallelism. To improve the LiM parallelism, we are inspired by investigating how the single-instruction, multiple-data (SIMD) instruction set in conventional CPU could potentially help to expand the number of LiM operands in one cycle. The biggest challenge is the inefficiency in handling non-continuous data in parallel due to the SIMD limitation of (i) continuous address, (ii) limited cache bandwidth, and (iii) large full-resolution parallel computing overheads. This article presents GRAPHIC, the first reported in-memory SIMD architecture that solves the parallelism and irregular data access challenges in applying SIMD to LiM. GRAPHIC exploits content-addressable memory (CAM) and row-wise-accessible SRAM. By providing the in-situ, full-parallelism, and low-overhead operations of address search, cache read-compute-and-update, GRAPHIC accomplishes high-efficiency gather and aggregation with high parallelism, high energy efficiency, low latency, and low area overheads. Experiments in both continuous data access and irregular data pattern applications show an average speedup of 5x over iso-area AVX-like LiM, and 3-5x over the emerging CAM-based accelerators of CAPE and GaaS-X in advanced techniques.
{"title":"GRAPHIC: Gather and Process Harmoniously in the Cache With High Parallelism and Flexibility","authors":"Yiming Chen;Mingyen Lee;Guohao Dai;Mufeng Zhou;Nagadastagiri Challapalle;Tianyi Wang;Yao Yu;Yongpan Liu;Yu Wang;Huazhong Yang;Vijaykrishnan Narayanan;Xueqing Li","doi":"10.1109/TETC.2023.3290683","DOIUrl":"10.1109/TETC.2023.3290683","url":null,"abstract":"In-memory computing (IMC) has been proposed to overcome the von Neumann bottleneck in data-intensive applications. However, existing IMC solutions could not achieve both high parallelism and high flexibility, which limits their application in more general scenarios: As a highly parallel IMC design, the functionality of a MAC crossbar is limited to the matrix-vector multiplication; Another IMC method of logic-in-memory (LiM) is more flexible in supporting different logic functions, but has low parallelism. To improve the LiM parallelism, we are inspired by investigating how the single-instruction, multiple-data (SIMD) instruction set in conventional CPU could potentially help to expand the number of LiM operands in one cycle. The biggest challenge is the inefficiency in handling non-continuous data in parallel due to the SIMD limitation of (i) continuous address, (ii) limited cache bandwidth, and (iii) large full-resolution parallel computing overheads. This article presents GRAPHIC, the first reported in-memory SIMD architecture that solves the parallelism and irregular data access challenges in applying SIMD to LiM. GRAPHIC exploits content-addressable memory (CAM) and row-wise-accessible SRAM. By providing the in-situ, full-parallelism, and low-overhead operations of address search, cache read-compute-and-update, GRAPHIC accomplishes high-efficiency gather and aggregation with high parallelism, high energy efficiency, low latency, and low area overheads. Experiments in both continuous data access and irregular data pattern applications show an average speedup of 5x over iso-area AVX-like LiM, and 3-5x over the emerging CAM-based accelerators of CAPE and GaaS-X in advanced techniques.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"12 1","pages":"84-96"},"PeriodicalIF":5.9,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62528360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-13DOI: 10.1109/TETC.2023.3293477
Xiaozhou Lu;Sunghwan Kim
As maintaining a proper balanced GC content is crucial for minimizing errors in DNA storage, constructing GC-balanced DNA codes has become an important research topic. In this article, we propose a novel code construction method based on the weight distribution of the data, which enables us to construct GC-balanced DNA codes. Additionally, we introduce a specific encoding process for both balanced and imbalanced data parts. One of the key differences between the proposed codes and existing codes is that the parity lengths of the proposed codes are variable depending on the data parts, while the parity lengths of existing codes remain fixed. To evaluate the effectiveness of the proposed codes, we compare their average parity lengths to those of existing codes. Our results demonstrate that the proposed codes have significantly shorter average parity lengths for DNA sequences with appropriate GC contents.
保持适当平衡的 GC 含量对于减少 DNA 存储中的错误至关重要,因此构建 GC 平衡 DNA 代码已成为一个重要的研究课题。在本文中,我们提出了一种基于数据权重分布的新型代码构建方法,它使我们能够构建 GC 平衡 DNA 代码。此外,我们还为平衡和不平衡数据部分引入了特定的编码过程。拟议代码与现有代码的主要区别之一是,拟议代码的奇偶校验长度可根据数据部分的不同而变化,而现有代码的奇偶校验长度则保持固定。为了评估拟议编码的有效性,我们将其平均奇偶校验长度与现有编码的平均奇偶校验长度进行了比较。结果表明,对于具有适当 GC 含量的 DNA 序列,建议的编码具有明显较短的平均奇偶校验长度。
{"title":"New Construction of Balanced Codes Based on Weights of Data for DNA Storage","authors":"Xiaozhou Lu;Sunghwan Kim","doi":"10.1109/TETC.2023.3293477","DOIUrl":"10.1109/TETC.2023.3293477","url":null,"abstract":"As maintaining a proper balanced GC content is crucial for minimizing errors in DNA storage, constructing GC-balanced DNA codes has become an important research topic. In this article, we propose a novel code construction method based on the weight distribution of the data, which enables us to construct GC-balanced DNA codes. Additionally, we introduce a specific encoding process for both balanced and imbalanced data parts. One of the key differences between the proposed codes and existing codes is that the parity lengths of the proposed codes are variable depending on the data parts, while the parity lengths of existing codes remain fixed. To evaluate the effectiveness of the proposed codes, we compare their average parity lengths to those of existing codes. Our results demonstrate that the proposed codes have significantly shorter average parity lengths for DNA sequences with appropriate GC contents.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"11 4","pages":"973-984"},"PeriodicalIF":5.9,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62528959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-12DOI: 10.1109/TETC.2023.3293426
Khakim Akhunov;Kasım Sinan Yıldırım
There is an emerging requirement for performing data-intensive parallel computations, e.g., machine-learning inference, locally on batteryless sensors. These devices are resource-constrained and operate intermittently due to the irregular energy availability in the environment. Intermittent execution might lead to several side effects that might prevent the correct execution of computational tasks. Even though recent studies proposed methods to cope with these side effects and execute these tasks correctly, they overlooked the efficient intermittent execution of parallelizable data-intensive machine-learning tasks. In this article, we present PiMCo—a novel programmable CRAM-based in-memory coprocessor that exploits the Processing In-Memory (PIM) paradigm and facilitates the power-failure resilient execution of parallelizable computational loads. Contrary to existing PIM solutions for intermittent computing, PiMCo promotes better programmability to accelerate a variety of parallelizable tasks. Our performance evaluation demonstrates that PiMCo improves the performance of existing low-power accelerators for intermittent computing by up to 8× and energy efficiency by up to 150×.
{"title":"CRAM-Based Acceleration for Intermittent Computing of Parallelizable Tasks","authors":"Khakim Akhunov;Kasım Sinan Yıldırım","doi":"10.1109/TETC.2023.3293426","DOIUrl":"10.1109/TETC.2023.3293426","url":null,"abstract":"There is an emerging requirement for performing data-intensive parallel computations, e.g., machine-learning inference, locally on batteryless sensors. These devices are resource-constrained and operate intermittently due to the irregular energy availability in the environment. Intermittent execution might lead to several side effects that might prevent the correct execution of computational tasks. Even though recent studies proposed methods to cope with these side effects and execute these tasks correctly, they overlooked the efficient intermittent execution of parallelizable data-intensive machine-learning tasks. In this article, we present PiMCo—a novel programmable CRAM-based in-memory coprocessor that exploits the Processing In-Memory (PIM) paradigm and facilitates the power-failure resilient execution of parallelizable computational loads. Contrary to existing PIM solutions for intermittent computing, PiMCo promotes better programmability to accelerate a variety of parallelizable tasks. Our performance evaluation demonstrates that PiMCo improves the performance of existing low-power accelerators for intermittent computing by up to 8× and energy efficiency by up to 150×.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"12 1","pages":"48-59"},"PeriodicalIF":5.9,"publicationDate":"2023-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62528922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-12DOI: 10.1109/TETC.2023.3293140
Purab Ranjan Sutradhar;Sathwika Bavikadi;Sai Manoj Pudukotai Dinakarrao;Mark A. Indovina;Amlan Ganguly
Memory-centric computing systems have demonstrated superior performance and efficiency in memory-intensive applications compared to state-of-the-art CPUs and GPUs. 3-D stacked DRAM architectures unlock higher I/O data bandwidth than the traditional 2-D memory architecture and therefore are better suited for incorporating memory-centric processors. However, merely integrating high-precision ALUs in the 3-D stacked memory does not ensure an optimized design since such a design can only achieve a limited utilization of the internal bandwidth of a memory chip and limited operational parallelization. To address this, we propose 3DL-PIM, a 3-D stacked memory-based Processing in Memory (PIM) architecture that locates a plurality of Look-up Table (LUT)-based low-footprint Processing Elements (PE) within the memory banks in order to achieve high parallel computing performance by maximizing data-bandwidth utilization. Instead of relying on the traditional logic-based ALUs, the PEs are formed by clustering a group of programmable LUTs and therefore can be programmed on-the-fly to perform various logic/arithmetic operations. Our simulations show that 3DL-PIM can achieve respectively up to 2.6× higher processing performance at 2.65× higher area efficiency compared to a state-of-the-art 3-D stacked memory-based accelerator.
与最先进的 CPU 和 GPU 相比,以内存为中心的计算系统在内存密集型应用中表现出卓越的性能和效率。与传统的二维内存架构相比,三维堆叠 DRAM 架构可释放更高的 I/O 数据带宽,因此更适合集成以内存为中心的处理器。然而,仅仅在三维堆叠内存中集成高精度 ALU 并不能确保设计的优化,因为这样的设计只能实现有限的内存芯片内部带宽利用率和有限的操作并行化。针对这一问题,我们提出了基于三维堆叠内存的内存中处理(PIM)架构--3DL-PIM,该架构将多个基于查找表(LUT)的低脚本处理单元(PE)置于内存库中,通过最大限度地利用数据带宽来实现高并行计算性能。PE 不依赖于传统的基于逻辑的 ALU,而是由一组可编程 LUT 组成,因此可以通过即时编程来执行各种逻辑/算术运算。我们的模拟结果表明,与最先进的基于内存的三维堆叠加速器相比,3DL-PIM 的处理性能最多可提高 2.6 倍,而面积效率却提高了 2.65 倍。
{"title":"3DL-PIM: A Look-Up Table Oriented Programmable Processing in Memory Architecture Based on the 3-D Stacked Memory for Data-Intensive Applications","authors":"Purab Ranjan Sutradhar;Sathwika Bavikadi;Sai Manoj Pudukotai Dinakarrao;Mark A. Indovina;Amlan Ganguly","doi":"10.1109/TETC.2023.3293140","DOIUrl":"10.1109/TETC.2023.3293140","url":null,"abstract":"Memory-centric computing systems have demonstrated superior performance and efficiency in memory-intensive applications compared to state-of-the-art CPUs and GPUs. 3-D stacked DRAM architectures unlock higher I/O data bandwidth than the traditional 2-D memory architecture and therefore are better suited for incorporating memory-centric processors. However, merely integrating high-precision ALUs in the 3-D stacked memory does not ensure an optimized design since such a design can only achieve a limited utilization of the internal bandwidth of a memory chip and limited operational parallelization. To address this, we propose 3DL-PIM, a 3-D stacked memory-based Processing in Memory (PIM) architecture that locates a plurality of Look-up Table (LUT)-based low-footprint Processing Elements (PE) within the memory banks in order to achieve high parallel computing performance by maximizing data-bandwidth utilization. Instead of relying on the traditional logic-based ALUs, the PEs are formed by clustering a group of programmable LUTs and therefore can be programmed on-the-fly to perform various logic/arithmetic operations. Our simulations show that 3DL-PIM can achieve respectively up to 2.6× higher processing performance at 2.65× higher area efficiency compared to a state-of-the-art 3-D stacked memory-based accelerator.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"12 1","pages":"60-72"},"PeriodicalIF":5.9,"publicationDate":"2023-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62528914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}