首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
DynaFlow: An ML Framework for Dynamic Dataflow Selection in SpGEMM Accelerators 在SpGEMM加速器中动态数据流选择的ML框架
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-15 DOI: 10.1109/LCA.2025.3570667
Sanjali Yadav;Bahar Asgari
Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning, leveraging matrix sparsity to reduce both storage and computation costs. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Existing hardware accelerators often employ fixed dataflows designed for specific sparsity patterns, leading to performance degradation when the input deviates from these assumptions. As SpGEMM adoption expands across a broad spectrum of sparsity workloads, the demand grows for accelerators capable of dynamically adapting their dataflow schemes to diverse sparsity patterns. To address this, we propose DynaFlow, a machine learning-based framework that trains on the set of dataflows supported by any given accelerator and learns to predict the optimal dataflow based on the input sparsity pattern. By leveraging decision trees and deep reinforcement learning, DynaFlow surpasses static dataflow selection approaches, achieving up to a 50× speedup.
稀疏矩阵-矩阵乘法(SpGEMM)是科学计算、图分析和深度学习等众多领域的关键运算,利用矩阵稀疏性来降低存储和计算成本。然而,稀疏矩阵的不规则结构给性能优化带来了巨大的挑战。现有的硬件加速器通常采用为特定稀疏模式设计的固定数据流,当输入偏离这些假设时,会导致性能下降。随着SpGEMM在各种稀疏性工作负载上的广泛应用,对能够动态调整其数据流方案以适应各种稀疏性模式的加速器的需求也在增长。为了解决这个问题,我们提出了DynaFlow,这是一个基于机器学习的框架,它在任何给定加速器支持的数据流集上进行训练,并学习基于输入稀疏性模式预测最佳数据流。通过利用决策树和深度强化学习,DynaFlow超越了静态数据流选择方法,实现了高达50倍的加速。
{"title":"DynaFlow: An ML Framework for Dynamic Dataflow Selection in SpGEMM Accelerators","authors":"Sanjali Yadav;Bahar Asgari","doi":"10.1109/LCA.2025.3570667","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570667","url":null,"abstract":"Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning, leveraging matrix sparsity to reduce both storage and computation costs. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Existing hardware accelerators often employ fixed dataflows designed for specific sparsity patterns, leading to performance degradation when the input deviates from these assumptions. As SpGEMM adoption expands across a broad spectrum of sparsity workloads, the demand grows for accelerators capable of dynamically adapting their dataflow schemes to diverse sparsity patterns. To address this, we propose DynaFlow, a machine learning-based framework that trains on the set of dataflows supported by any given accelerator and learns to predict the optimal dataflow based on the input sparsity pattern. By leveraging decision trees and deep reinforcement learning, DynaFlow surpasses static dataflow selection approaches, achieving up to a 50× speedup.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"189-192"},"PeriodicalIF":1.4,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search Cosmos:一个基于cxl的全内存系统,用于近似最近邻搜索
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-14 DOI: 10.1109/LCA.2025.3570235
Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn
Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.
检索增强生成(RAG)通过注入从外部源提取的适当上下文,对于提高大型语言模型的质量至关重要。RAG需要在十亿规模的矢量数据库上进行高吞吐量、低延迟的近似最近邻搜索(ANNS)。传统的DRAM/SSD解决方案面临容量/延迟限制,而专用硬件或RDMA集群缺乏灵活性或导致网络开销。我们提出Cosmos,在CXL存储设备中集成通用内核以实现全ANNS卸载,并引入秩级并行距离计算以最大化内存带宽。我们还提出了一种邻接感知的数据放置方法,该方法基于集群间的接近度平衡跨CXL设备的搜索负载。对SIFT1B和DEEP1B轨迹的评估表明,Cosmos的吞吐量比基线CXL系统高6.72倍,比最先进的基于CXL的解决方案高2.35倍,证明了RAG管道的可扩展性。
{"title":"Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search","authors":"Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn","doi":"10.1109/LCA.2025.3570235","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570235","url":null,"abstract":"Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present <sc>Cosmos</small>, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that <sc>Cosmos</small> achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"173-176"},"PeriodicalIF":1.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Minimal Counters, Maximum Insight: Simplifying System Performance With HPC Clusters for Optimized Monitoring 最小的计数器,最大的洞察力:简化系统性能与高性能计算集群优化监控
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-14 DOI: 10.1109/LCA.2025.3570157
Shubhi Shukla;Abhijeet Singh;Rajdeep Chakraborty;Anirban Chakraborty;Tejas Rathod;Harshal Mumbaikar;Manoj Kumar Munigala;Madhusudhan K N;Pabitra Mitra;Debdeep Mukhopadhyay
As computer systems become more complex, evaluating performance requires tracking various hardware performance counters that capture the system’s internal activities. While these counters provide valuable insights, their growing number makes it challenging to identify the most relevant ones for performance analysis. In this paper, we investigate the correlation between performance counter values and overall system performance, while also exploring the inter-correlation between different counters. Our findings demonstrate that specific counters are strongly correlated with key performance metrics and that significant redundancy exists among counters. By leveraging these relationships, we propose a method for selecting a small, representative set of performance counters. This streamlined set can further be used to accurately predict performance score across various workloads and system configurations.
随着计算机系统变得越来越复杂,评估性能需要跟踪捕捉系统内部活动的各种硬件性能计数器。虽然这些计数器提供了有价值的见解,但它们的数量不断增加,使得确定与性能分析最相关的计数器变得具有挑战性。在本文中,我们研究了性能计数器值与整体系统性能之间的相关性,同时也探索了不同计数器之间的相互关系。我们的研究结果表明,特定计数器与关键性能指标密切相关,并且计数器之间存在显着冗余。通过利用这些关系,我们提出了一种选择一个小的、有代表性的性能计数器集的方法。这个简化的集合可以进一步用于准确预测跨各种工作负载和系统配置的性能得分。
{"title":"Minimal Counters, Maximum Insight: Simplifying System Performance With HPC Clusters for Optimized Monitoring","authors":"Shubhi Shukla;Abhijeet Singh;Rajdeep Chakraborty;Anirban Chakraborty;Tejas Rathod;Harshal Mumbaikar;Manoj Kumar Munigala;Madhusudhan K N;Pabitra Mitra;Debdeep Mukhopadhyay","doi":"10.1109/LCA.2025.3570157","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570157","url":null,"abstract":"As computer systems become more complex, evaluating performance requires tracking various hardware performance counters that capture the system’s internal activities. While these counters provide valuable insights, their growing number makes it challenging to identify the most relevant ones for performance analysis. In this paper, we investigate the correlation between performance counter values and overall system performance, while also exploring the inter-correlation between different counters. Our findings demonstrate that specific counters are strongly correlated with key performance metrics and that significant redundancy exists among counters. By leveraging these relationships, we propose a method for selecting a small, representative set of performance counters. This streamlined set can further be used to accurately predict performance score across various workloads and system configurations.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"177-180"},"PeriodicalIF":1.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads SDT:通过同步数据传递线程减少数据中心的税收
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-11 DOI: 10.1109/LCA.2025.3549423
Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian
Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.
网络被认为是数据中心的一项税收,超大规模企业努力以最小的资源支出提供高性能网络。为了跟上不断增长的网络速率,许多CPU周期都花在了网络开销上。我们发现,网络处理线程可以在服务器cpu上同时执行,对应用程序线程的干扰最小。但是,利用同步多线程(SMT)根据应用程序线程的数量来扩展网络线程的数量会遇到以下问题:(1)无法为延迟关键型应用程序提供严格的尾部延迟要求;(2)减少了应用程序进程可用的硬件线程的数量,从而导致数据中心网络开销很高。在这项工作中,我们设计、实现和评估了一个芯片多处理器(CMP),每个物理核心都有专门的同步数据传递线程(SDT)。关键的见解是,通过在体系结构级别进行明智的分区,SDT可以安全地与应用程序进程协同运行,并保证性能隔离。我们使用全系统仿真的评估结果表明,与基线40核CMP相比,使用SDT增强的20核CMP的面积和功耗分别减少了47.5%和66%,而网络吞吐量减少了不到10%。
{"title":"SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads","authors":"Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian","doi":"10.1109/LCA.2025.3549423","DOIUrl":"https://doi.org/10.1109/LCA.2025.3549423","url":null,"abstract":"Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"93-96"},"PeriodicalIF":1.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems OASIS:用于扩展CXL存储系统中LLM推理的离群值感知KV缓存聚类
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-07 DOI: 10.1109/LCA.2025.3567844
Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee
The key-value (KV) cache in large language models (LLMs) now necessitates a substantial amount of memory capacity as its size proportionally grows with the context’s size. Recently, Compute-Express Link (CXL) memory becomes a promising method to secure memory capacity. However, CXL memory in a GPU-based LLM inference platform entails performance and scalability challenges due to the limited bandwidth of CXL memory. This paper proposes OASIS, an outlier-aware KV cache clustering for scaling LLM inference in CXL memory systems. Our method is based on the observation that clustering is effective in trading off between performance and accuracy compared to previous quantization- or selection-based approaches if clustering is aware of outliers. Our evaluation shows OASIS yields 3.6× speedup compared to the case without clustering while preserving accuracy with just 5% of full KV cache.
大型语言模型(llm)中的键值(KV)缓存现在需要大量的内存容量,因为它的大小随着上下文的大小成比例地增长。最近,计算机快速链接(CXL)存储器成为一种很有前途的存储容量保护方法。然而,由于CXL内存的带宽有限,基于gpu的LLM推理平台中的CXL内存带来了性能和可伸缩性方面的挑战。本文提出了OASIS,一种用于扩展CXL存储系统中LLM推理的离群值感知KV缓存聚类。我们的方法是基于这样的观察:如果聚类意识到异常值,那么与之前基于量化或选择的方法相比,聚类在性能和准确性之间的权衡是有效的。我们的评估表明,与没有聚类的情况相比,OASIS的速度提高了3.6倍,同时仅在5%的全KV缓存下保持精度。
{"title":"OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems","authors":"Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee","doi":"10.1109/LCA.2025.3567844","DOIUrl":"https://doi.org/10.1109/LCA.2025.3567844","url":null,"abstract":"The key-value (KV) cache in large language models (LLMs) now necessitates a substantial amount of memory capacity as its size proportionally grows with the context’s size. Recently, Compute-Express Link (CXL) memory becomes a promising method to secure memory capacity. However, CXL memory in a GPU-based LLM inference platform entails performance and scalability challenges due to the limited bandwidth of CXL memory. This paper proposes OASIS, an outlier-aware KV cache clustering for scaling LLM inference in CXL memory systems. Our method is based on the observation that clustering is effective in trading off between performance and accuracy compared to previous quantization- or selection-based approaches if clustering is aware of outliers. Our evaluation shows OASIS yields 3.6× speedup compared to the case without clustering while preserving accuracy with just 5% of full KV cache.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"165-168"},"PeriodicalIF":1.4,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Vector Permutation Instruction Execution via Controllable Bitonic Network 通过可控双元网络加速矢量置换指令的执行
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-05 DOI: 10.1109/LCA.2025.3548527
Shabirahmed Badashasab Jigalur;Daniel Jiménez Mazure;Teresa Cervero Garcia;Yen-Cheng Kuan
High-performance computing applications rely heavily on vector instructions to accelerate data processing. In this letter, we propose a controllable bitonic network (CBN) and use it as a lane interconnect to efficiently rearrange data across vector lanes of a vector processing unit to accelerate the execution of vector permutation instructions (VPIs). Our work focuses on the RISC-V vector instruction set because of its configurable vector length support. Through simulations with vector-permutation-intensive applications of a RISC-V vector benchmark suite (RiVEC), the proposed approach with an eight-lane 64-bit CBN demonstrates an average speedup of ≥6× regarding the VPI execution time over a conventional ring-network-based approach. In addition, to verify our approach on hardware, we implemented a processor system with an eight-lane 16-bit CBN on an AMD A7-100T FPGA operating at 20 MHz, demonstrating single-cycle execution of the RISC-V vr.gather and vr.scatter instructions.
高性能计算应用程序严重依赖矢量指令来加速数据处理。在本文中,我们提出了一种可控双元网络(CBN),并将其用作通道互连,以有效地重新排列矢量处理单元的矢量通道中的数据,以加速矢量置换指令(vpi)的执行。我们的工作重点是RISC-V矢量指令集,因为它支持可配置的矢量长度。通过对RISC-V矢量基准套件(RiVEC)的矢量置换密集型应用的模拟,所提出的八车道64位CBN方法在VPI执行时间方面的平均加速速度比传统的基于环网络的方法提高了≥6倍。此外,为了在硬件上验证我们的方法,我们在AMD A7-100T FPGA上实现了一个具有8通道16位CBN的处理器系统,工作频率为20 MHz,演示了RISC-V vr的单周期执行。集合起来。分散指令。
{"title":"Accelerating Vector Permutation Instruction Execution via Controllable Bitonic Network","authors":"Shabirahmed Badashasab Jigalur;Daniel Jiménez Mazure;Teresa Cervero Garcia;Yen-Cheng Kuan","doi":"10.1109/LCA.2025.3548527","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548527","url":null,"abstract":"High-performance computing applications rely heavily on vector instructions to accelerate data processing. In this letter, we propose a controllable bitonic network (CBN) and use it as a lane interconnect to efficiently rearrange data across vector lanes of a vector processing unit to accelerate the execution of vector permutation instructions (VPIs). Our work focuses on the RISC-V vector instruction set because of its configurable vector length support. Through simulations with vector-permutation-intensive applications of a RISC-V vector benchmark suite (RiVEC), the proposed approach with an eight-lane 64-bit CBN demonstrates an average speedup of ≥6× regarding the VPI execution time over a conventional ring-network-based approach. In addition, to verify our approach on hardware, we implemented a processor system with an eight-lane 16-bit CBN on an AMD A7-100T FPGA operating at 20 MHz, demonstrating single-cycle execution of the RISC-V <italic>vr.gather</i> and <italic>vr.scatter</i> instructions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"133-136"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration 数据模式驱动的LUT在cnn加速中的高效缓存计算
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-05 DOI: 10.1109/LCA.2025.3548080
Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue
The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.
基于查找表(LUT)的内存中处理(PIM)解决方案通过查找存储在LUT中的预计算结果来执行计算,为乘法等复杂操作提供了卓越的效率,使其非常适合节能和延迟高效的卷积神经网络(CNN)推理任务。然而,在LUT中包含所有可能的结果天真地需要指数级的硬件资源,这极大地限制了并行性,并增加了硬件面积、延迟和功耗开销。虽然分解和压缩技术可以减少LUT的大小,但它们也引入了相当大的内存访问开销和额外的操作。为了应对这些挑战,我们进行了广泛的分析,以确定哪些数据部分显著影响cnn的准确性。基于关键数据集中在小范围内的洞察力,我们提出了一种数据模式驱动(DPD)优化策略,该策略近似于不太关键的数据,以大幅减少LUT大小,同时在可接受的精度损失下保持计算效率。
{"title":"Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration","authors":"Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue","doi":"10.1109/LCA.2025.3548080","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548080","url":null,"abstract":"The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"81-84"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees DPWatch:一个基于硬件的差分隐私保证框架
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-04 DOI: 10.1109/LCA.2025.3547262
Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar
Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by clipping and adding noise to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.
差分隐私(DP)和联邦学习(FL)在使用敏感数据训练机器学习模型时已成为重要的隐私保护方法。通过以分布式方式训练模型,FL确保原始敏感数据不会离开用户的设备。DP通过对梯度进行剪切和添加噪声来确保模型不会泄露任何关于个体的信息。然而,这种算法的实际部署假设实现基于dp的FL的第三方应用程序是受信任的,因此可以访问数据所有者的设备/服务器上的敏感数据。在这项工作中,我们提出了DPWatch,这是一个基于硬件的机器学习加速器框架,它强制保证第三方应用程序不会泄露用于训练的敏感用户数据,并确保在离开设备之前对梯度进行适当的噪声处理。我们在两个加速器上评估DPWatch,并演示了较小的面积和性能开销。
{"title":"DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees","authors":"Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar","doi":"10.1109/LCA.2025.3547262","DOIUrl":"https://doi.org/10.1109/LCA.2025.3547262","url":null,"abstract":"Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by <italic>clipping</i> and adding <italic>noise</i> to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"89-92"},"PeriodicalIF":1.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management 紫晶:通过动态自动调优和虚拟机管理减少数据中心排放
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-02 DOI: 10.1109/LCA.2025.3566553
Mattia Tibaldi;Christian Pilato
To reduce emerging carbon emissions in cloud computing, we proposed Amethyst, a new VM placement and migration strategy capable of adapting consumption to the currently available green energy. Amethyst tackles the problem on three fronts: it adjusts the consumption to energy production, optimizes execution on FPGA accelerators, and balances execution among servers. We evaluate the strategy with real workloads. Our simulations on CloudSim Plus show that Amethyst effectively reduces the carbon emissions of cloud computing and, compared to the state-of-the-art, it increases the energy efficiency.
为了减少云计算中出现的碳排放,我们提出了紫晶,一种新的虚拟机放置和迁移策略,能够适应当前可用的绿色能源的消耗。Amethyst从三个方面解决了这个问题:它调整了能耗,优化了FPGA加速器上的执行,平衡了服务器之间的执行。我们用实际工作负载来评估该策略。我们在CloudSim Plus上的模拟表明,紫水晶有效地减少了云计算的碳排放,与最先进的相比,它提高了能源效率。
{"title":"Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management","authors":"Mattia Tibaldi;Christian Pilato","doi":"10.1109/LCA.2025.3566553","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566553","url":null,"abstract":"To reduce emerging carbon emissions in cloud computing, we proposed Amethyst, a new VM placement and migration strategy capable of adapting consumption to the currently available green energy. Amethyst tackles the problem on three fronts: it adjusts the consumption to energy production, optimizes execution on FPGA accelerators, and balances execution among servers. We evaluate the strategy with real workloads. Our simulations on CloudSim Plus show that Amethyst effectively reduces the carbon emissions of cloud computing and, compared to the state-of-the-art, it increases the energy efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"153-156"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fold-PIM: A Cost-Efficient LPDDR5-Based PIM for On-Device SLMs Fold-PIM:一种基于lpddr5的低成本器件级slm PIM
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-02 DOI: 10.1109/LCA.2025.3566692
Kyoungho Jeun;Hyeonu Kim;Eojin Lee
The increasing demand for on-device AI applications has shifted focus to Small Language Models (SLMs) optimized for mobile environments. However, the limited memory bandwidth of LPDDR5-based systems presents significant challenges for efficiently executing memory-bound matrix-vector multiplication operations, a core component of SLM inference. In this paper, we propose Fold-PIM, an LPDDR5-based Processing-in-Memory (PIM) architecture designed to address these challenges. Fold-PIM features a shared PU architecture that leverages subarray-level parallelism and employs key techniques with in-tile transposition, adaptive tiling, and a tailored protocol to reduce vector replacement latency. Our evaluation results demonstrate that Fold-PIM achieves up to 3.9× speedup of token generation time in SLM inference compared to the baseline system without PIM.
对设备上人工智能应用日益增长的需求已经将重点转移到针对移动环境优化的小语言模型(slm)上。然而,基于lpddr5的系统有限的内存带宽对有效执行内存约束的矩阵向量乘法运算(SLM推理的核心组件)提出了重大挑战。在本文中,我们提出Fold-PIM,一种基于lpddr5的内存中处理(PIM)架构,旨在解决这些挑战。Fold-PIM的特点是一个共享的PU架构,它利用了子阵列级的并行性,并采用了一些关键技术,如块内移位、自适应平铺和定制协议,以减少向量替换延迟。我们的评估结果表明,与没有PIM的基准系统相比,Fold-PIM在SLM推理中的令牌生成时间加快了3.9倍。
{"title":"Fold-PIM: A Cost-Efficient LPDDR5-Based PIM for On-Device SLMs","authors":"Kyoungho Jeun;Hyeonu Kim;Eojin Lee","doi":"10.1109/LCA.2025.3566692","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566692","url":null,"abstract":"The increasing demand for on-device AI applications has shifted focus to Small Language Models (SLMs) optimized for mobile environments. However, the limited memory bandwidth of LPDDR5-based systems presents significant challenges for efficiently executing memory-bound matrix-vector multiplication operations, a core component of SLM inference. In this paper, we propose Fold-PIM, an LPDDR5-based Processing-in-Memory (PIM) architecture designed to address these challenges. Fold-PIM features a shared PU architecture that leverages subarray-level parallelism and employs key techniques with in-tile transposition, adaptive tiling, and a tailored protocol to reduce vector replacement latency. Our evaluation results demonstrate that Fold-PIM achieves up to 3.9× speedup of token generation time in SLM inference compared to the baseline system without PIM.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"185-188"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144206124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1