首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture 通过动态不对称架构提高深度神经网络训练效率
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-12 DOI: 10.1109/LCA.2023.3275909
Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser
Deep neural networks (DNNs) require abundant multiply-and-accumulate (MAC) operations. Thanks to DNNs’ ability to accommodate noise, some of the computational burden is commonly mitigated by quantization–that is, by using lower precision floating-point operations. Layer granularity is the preferred method, as it is easily mapped to commodity hardware. In this paper, we propose Dynamic Asymmetric Architecture (DAA), in which the micro-architecture decides what the precision of each MAC operation should be during runtime. We demonstrate a DAA with two data streams and a value-based controller that decides which data stream deserves the higher precision resource. We evaluate this mechanism in terms of accuracy on a number of convolutional neural networks (CNNs) and demonstrate its feasibility on top of a systolic array. Our experimental analysis shows that DAA potentially achieves 2x throughput improvement for ResNet-18 while saving 35% of the energy with less than 0.5% degradation in accuracy.
深度神经网络(dnn)需要大量的乘法累加(MAC)运算。由于深度神经网络具有适应噪声的能力,一些计算负担通常可以通过量化来减轻,也就是说,通过使用精度较低的浮点运算。层粒度是首选的方法,因为它很容易映射到商用硬件。在本文中,我们提出了动态不对称架构(DAA),其中微架构决定每个MAC操作在运行时的精度。我们演示了一个具有两个数据流和一个基于值的控制器的DAA,该控制器决定哪个数据流应该获得更高精度的资源。我们在许多卷积神经网络(cnn)上评估了这种机制的准确性,并证明了其在收缩阵列上的可行性。我们的实验分析表明,DAA有可能使ResNet-18的吞吐量提高2倍,同时节省35%的能量,精度下降不到0.5%。
{"title":"Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture","authors":"Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser","doi":"10.1109/LCA.2023.3275909","DOIUrl":"10.1109/LCA.2023.3275909","url":null,"abstract":"Deep neural networks (DNNs) require abundant multiply-and-accumulate (MAC) operations. Thanks to DNNs’ ability to accommodate noise, some of the computational burden is commonly mitigated by quantization–that is, by using lower precision floating-point operations. Layer granularity is the preferred method, as it is easily mapped to commodity hardware. In this paper, we propose Dynamic Asymmetric Architecture (DAA), in which the micro-architecture decides what the precision of each MAC operation should be during runtime. We demonstrate a DAA with two data streams and a value-based controller that decides which data stream deserves the higher precision resource. We evaluate this mechanism in terms of accuracy on a number of convolutional neural networks (CNNs) and demonstrate its feasibility on top of a systolic array. Our experimental analysis shows that DAA potentially achieves 2x throughput improvement for ResNet-18 while saving 35% of the energy with less than 0.5% degradation in accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45178350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware-Implemented Lightweight Accelerator for Large Integer Polynomial Multiplication 大整数多项式乘法的硬件实现轻量级加速器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-10 DOI: 10.1109/LCA.2023.3274931
Pengzhou He;Yazheng Tu;Çetin Kaya Koç;Jiafeng Xie
Large integer polynomial multiplication is frequently used as a key component in post-quantum cryptography (PQC) algorithms. Following the trend that efficient hardware implementation for PQC is emphasized, in this letter, we propose a new hardware-implemented lightweight accelerator for the large integer polynomial multiplication of Saber (one of the National Institute of Standards and Technology third-round finalists). First, we provided a derivation process to obtain the algorithm for the targeted polynomial multiplication. Then, the proposed algorithm is mapped into an optimized hardware accelerator. Finally, we demonstrated the efficiency of the proposed design, e.g., this accelerator with $v=32$ has at least 48.37% less area-delay product (ADP) than the existing designs. The outcome of this work is expected to provide useful references for efficient implementation of other PQC.
大整数多项式乘法是后量子加密(PQC)算法中经常使用的关键组件。在强调PQC的高效硬件实现的趋势下,在这封信中,我们提出了一种新的硬件实现的轻量级加速器,用于Saber(美国国家标准与技术研究院第三轮决赛选手之一)的大整数多项式乘法。首先,我们提供了一个推导过程,以获得目标多项式乘法的算法。然后,将该算法映射到优化后的硬件加速器中。最后,我们证明了所提出设计的效率,例如,$v=32$v=32的加速器比现有设计至少减少48.37%的面积延迟积(ADP)。本文的研究结果有望为其他PQC的有效实施提供有益的参考。
{"title":"Hardware-Implemented Lightweight Accelerator for Large Integer Polynomial Multiplication","authors":"Pengzhou He;Yazheng Tu;Çetin Kaya Koç;Jiafeng Xie","doi":"10.1109/LCA.2023.3274931","DOIUrl":"10.1109/LCA.2023.3274931","url":null,"abstract":"Large integer polynomial multiplication is frequently used as a key component in post-quantum cryptography (PQC) algorithms. Following the trend that efficient hardware implementation for PQC is emphasized, in this letter, we propose a new hardware-implemented lightweight accelerator for the large integer polynomial multiplication of Saber (one of the National Institute of Standards and Technology third-round finalists). First, we provided a derivation process to obtain the algorithm for the targeted polynomial multiplication. Then, the proposed algorithm is mapped into an optimized hardware accelerator. Finally, we demonstrated the efficiency of the proposed design, e.g., this accelerator with \u0000<inline-formula><tex-math>$v=32$</tex-math></inline-formula>\u0000 has at least 48.37% less area-delay product (ADP) than the existing designs. The outcome of this work is expected to provide useful references for efficient implementation of other PQC.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"57-60"},"PeriodicalIF":2.3,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47266691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Memory Versioning (IMV) 内存版本控制(IMV)
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-05 DOI: 10.1109/LCA.2023.3273124
David Andrew Roberts;Haojie Ye;Tony Brewer;Sean Eilert
In this letter, we propose and evaluate designs for a novel hardware-assisted data versioning system (in-memory versioning or IMV) in the context of high-performance computing. Our main novelty and advantage over recent published work is that it does not require any changes to host processor logic, instead augmenting a memory controller within memory modules. It is faster and more efficient than existing high-performance computing (HPC) checkpointing schemes and works from hours to sub-second checkpoint intervals. The main premise is to perform most operations in hardware at cache-line granularity, avoiding operating system (OS) latency and page copying bandwidth overhead. Energy is saved by keeping data movement in the memory module, compared with page granularity cross channel or cross-network copying that is currently used. For a 1-second checkpoint commit interval, we demonstrate up to 20x checkpoint performance and 70x energy savings using IMV versus page copy-on-write (COW).
在这封信中,我们提出并评估了一种新的硬件辅助数据版本控制系统(内存版本控制或IMV)在高性能计算环境中的设计。与最近发表的工作相比,我们的主要新颖性和优势在于,它不需要对主机处理器逻辑进行任何更改,而是在内存模块中增强内存控制器。它比现有的高性能计算(HPC)检查点方案更快、更高效,工作时间从小时到亚秒不等。主要前提是在硬件中以缓存线粒度执行大多数操作,避免操作系统(OS)延迟和页面复制带宽开销。与当前使用的页面粒度跨通道或跨网络复制相比,通过在内存模块中保持数据移动可以节省能源。对于1秒的检查点提交间隔,我们使用IMV与写时页面复制(COW)相比,展示了高达20倍的检查点性能和70倍的节能。
{"title":"In-Memory Versioning (IMV)","authors":"David Andrew Roberts;Haojie Ye;Tony Brewer;Sean Eilert","doi":"10.1109/LCA.2023.3273124","DOIUrl":"10.1109/LCA.2023.3273124","url":null,"abstract":"In this letter, we propose and evaluate designs for a novel hardware-assisted data versioning system (in-memory versioning or IMV) in the context of high-performance computing. Our main novelty and advantage over recent published work is that it does not require any changes to host processor logic, instead augmenting a memory controller within memory modules. It is faster and more efficient than existing high-performance computing (HPC) checkpointing schemes and works from hours to sub-second checkpoint intervals. The main premise is to perform most operations in hardware at cache-line granularity, avoiding operating system (OS) latency and page copying bandwidth overhead. Energy is saved by keeping data movement in the memory module, compared with page granularity cross channel or cross-network copying that is currently used. For a 1-second checkpoint commit interval, we demonstrate up to 20x checkpoint performance and 70x energy savings using IMV versus page copy-on-write (COW).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"65-68"},"PeriodicalIF":2.3,"publicationDate":"2023-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49519371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Energy-Efficient Bayesian Inference Using Bitstream Computing 基于比特流计算的节能贝叶斯推理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-02-14 DOI: 10.1109/LCA.2023.3238584
Soroosh Khoram;Kyle Daruwalla;Mikko Lipasti
Uncertainty quantification is critical to many machine learning applications especially in mobile and edge computing tasks like self-driving cars, robots, and mobile devices. Bayesian Neural Networks can be used to provide these uncertainty quantifications but they come at extra computation costs. However, power and energy can be limited at the edge. In this work, we propose using stochastic bitstream computing substrates for deploying BNNs which can significantly reduce power and costs. We design our Bayesian Bitstream Processor hardware for an audio classification task as a test case and show that it can outperform a micro-controller baseline in energy by two orders of magnitude and delay by an order of magnitude, at lower power.
不确定性量化对许多机器学习应用至关重要,尤其是在自动驾驶汽车、机器人和移动设备等移动和边缘计算任务中。贝叶斯神经网络可以用来提供这些不确定性量化,但它们需要额外的计算成本。然而,功率和能量可能在边缘受到限制。在这项工作中,我们建议使用随机比特流计算基底来部署BNN,这可以显著降低功耗和成本。我们为音频分类任务设计了贝叶斯比特流处理器硬件作为测试用例,并表明在较低的功率下,它在能量和延迟方面可以优于微控制器基线两个数量级。
{"title":"Energy-Efficient Bayesian Inference Using Bitstream Computing","authors":"Soroosh Khoram;Kyle Daruwalla;Mikko Lipasti","doi":"10.1109/LCA.2023.3238584","DOIUrl":"10.1109/LCA.2023.3238584","url":null,"abstract":"Uncertainty quantification is critical to many machine learning applications especially in mobile and edge computing tasks like self-driving cars, robots, and mobile devices. Bayesian Neural Networks can be used to provide these uncertainty quantifications but they come at extra computation costs. However, power and energy can be limited at the edge. In this work, we propose using stochastic bitstream computing substrates for deploying BNNs which can significantly reduce power and costs. We design our Bayesian Bitstream Processor hardware for an audio classification task as a test case and show that it can outperform a micro-controller baseline in energy by two orders of magnitude and delay by an order of magnitude, at lower power.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47107644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Intelligent SSD Firmware for Zero-Overhead Journaling 用于零开销日志的智能SSD固件
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-02-09 DOI: 10.1109/LCA.2023.3243695
Hanyeoreum Bae;Donghyun Gouk;Seungjun Lee;Jiseon Kim;Sungjoon Koh;Jie Zhang;Myoungsoo Jung
We propose Check0-SSD, an intelligent SSD firmware to offer the best system-level fault-tolerance without performance degradation and lifetime shortening. Specifically, the SSD firmware autonomously removes transaction checkpointing, which eliminates redundant writes to the flash backend. To this end, Check0-SSD dynamically classifies journal descriptor/commit requests at runtime and switches the address spaces between journal and data regions by examining the host's filesystem layout and journal region information in a self-governing manner. Our evaluations demonstrate that Check0-SSD can protect both data and metadata with 89% enhanced storage lifetime while exhibiting similar or even better performance compared to the norecovery SSD.
我们提出Check0 SSD,这是一种智能SSD固件,可在不降低性能和缩短使用寿命的情况下提供最佳的系统级容错能力。具体而言,SSD固件会自动删除事务检查点,从而消除对闪存后端的冗余写入。为此,Check0 SSD在运行时对日志描述符/提交请求进行动态分类,并通过以自治方式检查主机的文件系统布局和日志区域信息,在日志和数据区域之间切换地址空间。我们的评估表明,Check0 SSD可以保护数据和元数据,存储寿命提高89%,同时与norecovery SSD相比表现出类似甚至更好的性能。
{"title":"Intelligent SSD Firmware for Zero-Overhead Journaling","authors":"Hanyeoreum Bae;Donghyun Gouk;Seungjun Lee;Jiseon Kim;Sungjoon Koh;Jie Zhang;Myoungsoo Jung","doi":"10.1109/LCA.2023.3243695","DOIUrl":"10.1109/LCA.2023.3243695","url":null,"abstract":"We propose Check0-SSD, an intelligent SSD firmware to offer the best system-level fault-tolerance without performance degradation and lifetime shortening. Specifically, the SSD firmware autonomously removes transaction checkpointing, which eliminates redundant writes to the flash backend. To this end, Check0-SSD dynamically classifies journal descriptor/commit requests at runtime and switches the address spaces between journal and data regions by examining the host's filesystem layout and journal region information in a self-governing manner. Our evaluations demonstrate that Check0-SSD can protect both data and metadata with 89% enhanced storage lifetime while exhibiting similar or even better performance compared to the norecovery SSD.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"25-28"},"PeriodicalIF":2.3,"publicationDate":"2023-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49254659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Last-Level Cache Insertion and Promotion Policy in the Presence of Aggressive Prefetching 存在野蛮预取时的最后一级缓存插入和提升策略
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-02-03 DOI: 10.1109/LCA.2023.3242178
Daniel A. Jiménez;Elvira Teran;Paul V. Gratz
The last-level cache (LLC) is the last chance for memory accesses from the processor to avoid the costly latency of going to main memory. LLC management has been the topic of intense research focusing on two main techniques: replacement and prefetching. However, these two ideas are often evaluated separately, with one being studied outside the context of the state-of-the-art in the other. We find that high-performance replacement and highly accurate pattern-based prefetching do not result in synergistic improvements in performance. The overhead of complex replacement policies is wasted in the presence of aggressive prefetchers. We find that a simple replacement policy with minimal overhead provides at least the same benefit as a state-of-the-art replacement policy in the presence of aggressive pattern-based prefetching. Our proposal is based on the idea of using a genetic algorithm to search the space of insertion and promotion policies that generalize transitions in the recency stack for the least-recently-used policy.
最后一级缓存(LLC)是处理器访问内存的最后机会,以避免进入主存的代价高昂的延迟。有限责任公司管理一直是研究的热点,主要集中在两种技术上:替代和预取。然而,这两种想法通常是分开评估的,其中一种是在另一种技术的背景之外进行研究的。我们发现高性能替换和高度精确的基于模式的预取不会导致性能的协同改进。在主动预取器的存在下,复杂替换策略的开销被浪费了。我们发现,一个开销最小的简单替换策略,在使用基于模式的主动预取的情况下,至少可以提供与最先进的替换策略相同的好处。我们的建议是基于使用遗传算法来搜索插入和提升策略的空间的想法,这些策略将最近使用最少的策略的转换推广到最近堆栈中。
{"title":"Last-Level Cache Insertion and Promotion Policy in the Presence of Aggressive Prefetching","authors":"Daniel A. Jiménez;Elvira Teran;Paul V. Gratz","doi":"10.1109/LCA.2023.3242178","DOIUrl":"10.1109/LCA.2023.3242178","url":null,"abstract":"The last-level cache (LLC) is the last chance for memory accesses from the processor to avoid the costly latency of going to main memory. LLC management has been the topic of intense research focusing on two main techniques: replacement and prefetching. However, these two ideas are often evaluated separately, with one being studied outside the context of the state-of-the-art in the other. We find that high-performance replacement and highly accurate pattern-based prefetching do not result in synergistic improvements in performance. The overhead of complex replacement policies is wasted in the presence of aggressive prefetchers. We find that a simple replacement policy with minimal overhead provides at least the same benefit as a state-of-the-art replacement policy in the presence of aggressive pattern-based prefetching. Our proposal is based on the idea of using a genetic algorithm to search the space of insertion and promotion policies that generalize transitions in the recency stack for the least-recently-used policy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"17-20"},"PeriodicalIF":2.3,"publicationDate":"2023-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45606044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ADT: Aggressive Demotion and Promotion for Tiered Memory ADT:分层内存的积极降级和升级
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-01-13 DOI: 10.1109/LCA.2023.3236685
Yaebin Moon;Wanju Doh;Kwanhee Kyung;Eojin Lee;Jung Ho Ahn
Tiered memory using DRAM as upper-tier (fast memory) and emerging slower-but-larger byte-addressable memory as lower-tier (slow memory) is a promising approach to expanding main-memory capacity. Based on the observation that there are many cold pages in data-center applications, proactive demotion schemes demote cold pages to slow memory even when free space in fast memory is not deficient. Prior works on proactive demotion lower the requirement of expensive fast-memory capacity by reducing applications’ resident set size in fast memory. Also, some of the prior works mitigate the massive performance drop due to insufficient fast-memory capacity when there is a spike in demand for hot data. However, there is room for further improvement to save a larger fast-memory capacity with further aggressive demotion, which can fully reap the aforementioned advantages of proactive demotion. In this paper, we propose a new proactive demotion scheme, ADT, which performs aggressive demotion and promotion for tiered memory. Using the memory access locality within the unit in which applications and memory allocators allocate memory, ADT extends the unit of demotion/promotion from the page adopted by prior works to make its demotion more aggressive. By performing demotion and promotion by the extended unit, ADT reduces 29% of fast-memory usage with only a 2.3% performance drop. Also, it achieves 2.28× speedup compared to the default Linux kernel when the system's memory usage is larger than fast-memory capacity, which outperforms state-of-the-art schemes for tiered memory management.
分层内存使用DRAM作为上层(快速内存),新兴的更慢但更大的字节可寻址内存作为下层(慢速内存),这是一种很有前途的扩展主内存容量的方法。根据对数据中心应用程序中存在许多冷页的观察,主动降级方案将冷页降级到慢速内存,即使在快速内存中的可用空间不不足的情况下也是如此。先前在主动降级方面的工作通过减少应用程序在快速内存中的驻留集大小来降低对昂贵的快速内存容量的需求。此外,先前的一些工作减轻了由于对热数据的需求激增时快速内存容量不足而导致的大量性能下降。然而,还有进一步改进的空间,通过进一步的主动降级来节省更大的快速内存容量,这可以充分获得前面提到的主动降级的优势。在本文中,我们提出了一种新的主动降级方案,ADT,它对分层存储器进行主动降级和提升。利用应用程序和内存分配器分配内存的单元内的内存访问位置,ADT从先前作品采用的页面扩展了降级/提升单元,使其降级更具侵略性。通过扩展单元的降级和提升,ADT减少了29%的快速内存使用,而性能仅下降了2.3%。此外,当系统的内存使用量大于快速内存容量时,与默认Linux内核相比,它实现了2.28倍的加速,这优于最先进的分层内存管理方案。
{"title":"ADT: Aggressive Demotion and Promotion for Tiered Memory","authors":"Yaebin Moon;Wanju Doh;Kwanhee Kyung;Eojin Lee;Jung Ho Ahn","doi":"10.1109/LCA.2023.3236685","DOIUrl":"10.1109/LCA.2023.3236685","url":null,"abstract":"Tiered memory using DRAM as upper-tier (fast memory) and emerging slower-but-larger byte-addressable memory as lower-tier (slow memory) is a promising approach to expanding main-memory capacity. Based on the observation that there are many cold pages in data-center applications, \u0000<italic>proactive demotion</i>\u0000 schemes demote cold pages to slow memory even when free space in fast memory is not deficient. Prior works on proactive demotion lower the requirement of expensive fast-memory capacity by reducing applications’ resident set size in fast memory. Also, some of the prior works mitigate the massive performance drop due to insufficient fast-memory capacity when there is a spike in demand for hot data. However, there is room for further improvement to save a larger fast-memory capacity with further aggressive demotion, which can fully reap the aforementioned advantages of proactive demotion. In this paper, we propose a new proactive demotion scheme, ADT, which performs \u0000<bold>a</b>\u0000ggressive \u0000<bold>d</b>\u0000emotion and promotion for \u0000<bold>t</b>\u0000iered memory. Using the memory access locality within the unit in which applications and memory allocators allocate memory, ADT extends the unit of demotion/promotion from the page adopted by prior works to make its demotion more aggressive. By performing demotion and promotion by the extended unit, ADT reduces 29% of fast-memory usage with only a 2.3% performance drop. Also, it achieves 2.28× speedup compared to the default Linux kernel when the system's memory usage is larger than fast-memory capacity, which outperforms state-of-the-art schemes for tiered memory management.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"21-24"},"PeriodicalIF":2.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43125075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HAMMER: Hardware-Friendly Approximate Computing for Self-Attention With Mean-Redistribution And Linearization HAMMER:具有均值重分布和线性化的自注意硬件友好近似计算
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-01-04 DOI: 10.1109/LCA.2022.3233832
Seonho Lee;Ranggi Hwang;Jongse Park;Minsoo Rhu
The recent advancement of the natural language processing (NLP) models is the result of the ever-increasing model size and datasets. Most of these modern NLP models adopt the Transformer based model architecture, whose main bottleneck is exhibited in the self-attention mechanism. As the computation required for self-attention increases rapidly as the model size gets larger, self-attentions have been the main challenge for deploying NLP models. Consequently, there are several prior works which sought to address this bottleneck, but most of them suffer from significant design overheads and additional training requirements. In this work, we propose HAMMER, hardware-friendly approximate computing solution for self-attentions employing mean-redistribution and linearization, which effectively increases the performance of self-attention mechanism with low overheads. Compared to previous state-of-the-art self-attention accelerators, HAMMER improves performance by $1.2-1.6times$ and energy efficiency by $1.2-1.5times$.
自然语言处理(NLP)模型的最新进展是模型大小和数据集不断增加的结果。这些现代NLP模型大多采用基于Transformer的模型架构,其主要瓶颈表现在自注意机制上。随着模型尺寸的增大,自注意所需的计算量迅速增加,自注意一直是部署NLP模型的主要挑战。因此,之前有几项工作试图解决这一瓶颈,但其中大多数都面临着巨大的设计开销和额外的培训要求。在这项工作中,我们提出了HAMMER,这是一种采用均值再分配和线性化的自注意硬件友好近似计算解决方案,它以低开销有效地提高了自注意机制的性能。与以前最先进的自注意加速器相比,HAMMER的性能提高了1.2-1.6美元倍1.2-1.6倍,能效提高了1.2-1.05美元倍1.2-1.5倍。
{"title":"HAMMER: Hardware-Friendly Approximate Computing for Self-Attention With Mean-Redistribution And Linearization","authors":"Seonho Lee;Ranggi Hwang;Jongse Park;Minsoo Rhu","doi":"10.1109/LCA.2022.3233832","DOIUrl":"10.1109/LCA.2022.3233832","url":null,"abstract":"The recent advancement of the natural language processing (NLP) models is the result of the ever-increasing model size and datasets. Most of these modern NLP models adopt the Transformer based model architecture, whose main bottleneck is exhibited in the self-attention mechanism. As the computation required for self-attention increases rapidly as the model size gets larger, self-attentions have been the main challenge for deploying NLP models. Consequently, there are several prior works which sought to address this bottleneck, but most of them suffer from significant design overheads and additional training requirements. In this work, we propose HAMMER, hardware-friendly approximate computing solution for self-attentions employing mean-redistribution and linearization, which effectively increases the performance of self-attention mechanism with low overheads. Compared to previous state-of-the-art self-attention accelerators, HAMMER improves performance by \u0000<inline-formula><tex-math>$1.2-1.6times$</tex-math></inline-formula>\u0000 and energy efficiency by \u0000<inline-formula><tex-math>$1.2-1.5times$</tex-math></inline-formula>\u0000.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"13-16"},"PeriodicalIF":2.3,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47478009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing Compilation of DNNs for FPGAs Using Operation Set Architectures 基于运算集架构的fpga深度神经网络的超前编译
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-12-13 DOI: 10.1109/LCA.2022.3227643
Burkhard Ringlein;Francois Abel;Dionysios Diamantopoulos;Beat Weiss;Christoph Hagleitner;Dietmar Fey
The slow-down of technology scaling combined with the exponential growth of modern machine learning and artificial intelligence models has created a demand for specialized accelerators, such as GPUs, ASICs, and field-programmable gate arrays (FPGAs). FPGAs can be reconfigured and have the potential to outperform other accelerators, while also being more energy-efficient, but are cumbersome to use with today's fractured landscape of tool flows. We propose the concept of an operation set architecture to overcome the current incompatibilities and hurdles in using DNN-to-FPGA compilers by combining existing specialized frameworks into one organic compiler that also allows the efficient and automatic re-use of existing community tools. Furthermore, we demonstrate that mixing different existing frameworks can increase the efficiency by more than an order of magnitude.
技术规模的放缓,加上现代机器学习和人工智能模型的指数级增长,产生了对专用加速器的需求,如GPU、ASIC和现场可编程门阵列(FPGA)。FPGA可以重新配置,有可能超越其他加速器,同时也更节能,但在当今工具流支离破碎的环境中使用起来很麻烦。我们提出了操作集架构的概念,通过将现有的专用框架组合到一个有机编译器中,来克服当前使用DNN到FPGA编译器的不兼容性和障碍,该编译器还允许有效和自动地重用现有的社区工具。此外,我们证明,混合不同的现有框架可以将效率提高一个数量级以上。
{"title":"Advancing Compilation of DNNs for FPGAs Using Operation Set Architectures","authors":"Burkhard Ringlein;Francois Abel;Dionysios Diamantopoulos;Beat Weiss;Christoph Hagleitner;Dietmar Fey","doi":"10.1109/LCA.2022.3227643","DOIUrl":"10.1109/LCA.2022.3227643","url":null,"abstract":"The slow-down of technology scaling combined with the exponential growth of modern machine learning and artificial intelligence models has created a demand for specialized accelerators, such as GPUs, ASICs, and field-programmable gate arrays (FPGAs). FPGAs can be reconfigured and have the potential to outperform other accelerators, while also being more energy-efficient, but are cumbersome to use with today's fractured landscape of tool flows. We propose the concept of an operation set architecture to overcome the current incompatibilities and hurdles in using DNN-to-FPGA compilers by combining existing specialized frameworks into one organic compiler that also allows the efficient and automatic re-use of existing community tools. Furthermore, we demonstrate that mixing different existing frameworks can increase the efficiency by more than an order of magnitude.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"9-12"},"PeriodicalIF":2.3,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41745751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CoreNap: Energy Efficient Core Allocation for Latency-Critical Workloads CoreNap:用于潜在关键工作负载的节能核心分配
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-12-08 DOI: 10.1109/LCA.2022.3227629
Gyeongseo Park;Ki-Dong Kang;Minho Kim;Daehoon Kim
In data-center servers, the dynamic core allocation for Latency-Critical (LC) applications can play a crucial role in improving energy efficiency under Service Level Objective (SLO) constraints, allowing cores to enter idle states (i.e., C-states) that consume less power by turning off a part of hardware components of a processor. However, prior studies focus on the core allocation for application threads while not considering cores involved in network packet processing, even though packet processing affects not only response latency but also energy consumption considerably. In this paper, we first investigate the impacts of the explicit core allocation for network packet processing on the tail response latency and energy consumption while running LC applications. We observe that co-adjusting the number of cores for network packet processing along with the number of cores for LC application threads can improve energy efficiency substantially, compared with adjusting the number of cores only for application threads, as prior studies do. In addition, we propose a dynamic core allocation, called CoreNap, which allocates/de-allocates cores for both LC application threads and packet processing. CoreNap measures the CPU-utilization by application threads and packet processing individually, and predicts response latency and power consumption when the combination of core allocation is enforced via a lightweight prediction model. Based on the prediction, CoreNap chooses/enforces the energy-efficient combination of core allocation. Our experimental results show that CoreNap reduces energy consumption by up to 18.6% compared with state-of-the-art study that adjusts cores only for LC application in parallel packet processing environments.
在数据中心服务器中,在服务级别目标(SLO)约束下,关键延迟(LC)应用程序的动态内核分配可以在提高能效方面发挥关键作用,允许内核通过关闭处理器的一部分硬件组件进入功耗较低的空闲状态(即C状态)。然而,先前的研究集中在应用程序线程的核心分配上,而没有考虑网络数据包处理中涉及的核心,尽管数据包处理不仅会显著影响响应延迟,还会显著影响能耗。在本文中,我们首先研究了网络数据包处理的显式核心分配对运行LC应用程序时的尾部响应延迟和能耗的影响。我们观察到,与之前的研究一样,只为应用线程调整内核数量相比,共同调整网络数据包处理的内核数量和LC应用线程的内核数量可以显著提高能效。此外,我们提出了一种动态内核分配,称为CoreNap,其为LC应用线程和分组处理两者分配/取消分配核心。CoreNap分别测量应用程序线程和数据包处理的CPU利用率,并预测通过轻量级预测模型强制执行核心分配组合时的响应延迟和功耗。基于预测,CoreNap选择/强制执行核心分配的节能组合。我们的实验结果表明,与仅针对并行数据包处理环境中的LC应用调整内核的最先进研究相比,CoreNap可将能耗降低18.6%。
{"title":"CoreNap: Energy Efficient Core Allocation for Latency-Critical Workloads","authors":"Gyeongseo Park;Ki-Dong Kang;Minho Kim;Daehoon Kim","doi":"10.1109/LCA.2022.3227629","DOIUrl":"10.1109/LCA.2022.3227629","url":null,"abstract":"In data-center servers, the dynamic core allocation for Latency-Critical (LC) applications can play a crucial role in improving energy efficiency under Service Level Objective (SLO) constraints, allowing cores to enter idle states (i.e., C-states) that consume less power by turning off a part of hardware components of a processor. However, prior studies focus on the core allocation for application threads while not considering cores involved in network packet processing, even though packet processing affects not only response latency but also energy consumption considerably. In this paper, we first investigate the impacts of the explicit core allocation for network packet processing on the tail response latency and energy consumption while running LC applications. We observe that co-adjusting the number of cores for network packet processing along with the number of cores for LC application threads can improve energy efficiency substantially, compared with adjusting the number of cores only for application threads, as prior studies do. In addition, we propose a dynamic core allocation, called \u0000<monospace>CoreNap</monospace>\u0000, which allocates/de-allocates cores for both LC application threads and packet processing. \u0000<monospace>CoreNap</monospace>\u0000 measures the CPU-utilization by application threads and packet processing individually, and predicts response latency and power consumption when the combination of core allocation is enforced via a lightweight prediction model. Based on the prediction, \u0000<monospace>CoreNap</monospace>\u0000 chooses/enforces the energy-efficient combination of core allocation. Our experimental results show that \u0000<monospace>CoreNap</monospace>\u0000 reduces energy consumption by up to 18.6% compared with state-of-the-art study that adjusts cores only for LC application in parallel packet processing environments.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"1-4"},"PeriodicalIF":2.3,"publicationDate":"2022-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42616181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1