首页 > 最新文献

IEEE Embedded Systems Letters最新文献

英文 中文
Optimizing Systolic Array-Based NTT Accelerators 优化基于收缩阵列的NTT加速器
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-22 DOI: 10.1109/LES.2025.3562707
Saleh Mulhem;Eike Schultz;Lukas Groth;Mladen Berekovic;Rainer Buchty
Lattice-based Post-quantum cryptography and Homomorphic Encryption schemes have become the key methodologies for today’s and the future’s secure world. This comes at the cost of a vastly increased computational load due to the multiplication of wide-integer coefficient polynomials. NIST recommends number theoretic transform (NTT) as an efficient remedy. Nevertheless, NTT strongly requires acceleration for large numbers of coefficients. This letter explores the use of systolic arrays as NTT accelerators and finds an optimal hardware architecture configuration across problem sizes. Design-space exploration is performed, resulting in a new design configuration for an efficient 2-D NTT accelerator without losing the ability to execute other workloads. Our finding indicates that for 22−nm technology, an optimal systolic array accelerator requires an area of 53.04 mm2. The accelerator can efficiently execute and apply NTT on a polynomial with 4096 32-bit integer coefficients requiring 3296 cycles, and 1794.92 nJ.
基于格的后量子密码和同态加密方案已经成为当今和未来安全世界的关键方法。这样做的代价是由于宽整数系数多项式的乘法而大大增加了计算负荷。NIST推荐数论变换(NTT)作为一种有效的补救措施。然而,NTT强烈要求大量系数的加速度。这封信探讨了收缩阵列作为NTT加速器的使用,并找到了跨问题大小的最佳硬件架构配置。执行设计空间探索,从而为高效的2-D NTT加速器提供新的设计配置,而不会失去执行其他工作负载的能力。我们的发现表明,对于22−nm技术,最佳的收缩阵列加速器需要53.04 mm2的面积。该加速器可以有效地在一个具有4096个32位整数系数的多项式上执行和应用NTT,需要3296个周期和1794.92 nJ。
{"title":"Optimizing Systolic Array-Based NTT Accelerators","authors":"Saleh Mulhem;Eike Schultz;Lukas Groth;Mladen Berekovic;Rainer Buchty","doi":"10.1109/LES.2025.3562707","DOIUrl":"https://doi.org/10.1109/LES.2025.3562707","url":null,"abstract":"Lattice-based Post-quantum cryptography and Homomorphic Encryption schemes have become the key methodologies for today’s and the future’s secure world. This comes at the cost of a vastly increased computational load due to the multiplication of wide-integer coefficient polynomials. NIST recommends number theoretic transform (NTT) as an efficient remedy. Nevertheless, NTT strongly requires acceleration for large numbers of coefficients. This letter explores the use of systolic arrays as NTT accelerators and finds an optimal hardware architecture configuration across problem sizes. Design-space exploration is performed, resulting in a new design configuration for an efficient 2-D NTT accelerator without losing the ability to execute other workloads. Our finding indicates that for 22−nm technology, an optimal systolic array accelerator requires an area of 53.04 mm2. The accelerator can efficiently execute and apply NTT on a polynomial with 4096 32-bit integer coefficients requiring 3296 cycles, and 1794.92 nJ.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"7-10"},"PeriodicalIF":2.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10974580","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Interchip Communication Method Suitable for Neuromorphic Chips by Detecting Data Stability 基于数据稳定性检测的神经形态芯片间通信方法
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-21 DOI: 10.1109/LES.2025.3562862
Jiaxu Cong;Jingyu Wang;Xiqin Tang;Bin Tong;Delong Shang
The sparsity of spike generation in neuromorphic computing establishes asynchronous communication as inherently more suited than serializer/deserializer (SerDes) for handling sparse interchip transmissions. However, traditional asynchronous methods often require more wires than the data bit width due to data encoding schemes. This letter introduces a novel method, detecting data stable receiver-transmitter (D2SRT), for interchip communication in neuromorphic chips. By detecting data stability, D2SRT achieves low power and high performance, with a single-bit energy consumption of 13.9 pJ and throughput of 217.2 Mb/s, surpassing traditional methods. Experimental results show that D2SRT meets the bandwidth requirements for most spiking neural networks (SNNs) and achieves exceptionally low dynamic power consumption.
神经形态计算中尖峰生成的稀疏性使得异步通信本质上比序列化/反序列化(SerDes)更适合处理稀疏的片间传输。然而,由于数据编码方案的原因,传统的异步方法通常需要比数据位宽更多的线路。本文介绍了一种用于神经形态芯片芯片间通信的检测数据稳定收发器(D2SRT)的新方法。通过检测数据稳定性,D2SRT实现了低功耗、高性能,单比特能耗13.9 pJ,吞吐量217.2 Mb/s,超越了传统方法。实验结果表明,D2SRT满足大多数尖峰神经网络(snn)的带宽要求,并实现了极低的动态功耗。
{"title":"An Interchip Communication Method Suitable for Neuromorphic Chips by Detecting Data Stability","authors":"Jiaxu Cong;Jingyu Wang;Xiqin Tang;Bin Tong;Delong Shang","doi":"10.1109/LES.2025.3562862","DOIUrl":"https://doi.org/10.1109/LES.2025.3562862","url":null,"abstract":"The sparsity of spike generation in neuromorphic computing establishes asynchronous communication as inherently more suited than serializer/deserializer (SerDes) for handling sparse interchip transmissions. However, traditional asynchronous methods often require more wires than the data bit width due to data encoding schemes. This letter introduces a novel method, detecting data stable receiver-transmitter (D2SRT), for interchip communication in neuromorphic chips. By detecting data stability, D2SRT achieves low power and high performance, with a single-bit energy consumption of 13.9 pJ and throughput of 217.2 Mb/s, surpassing traditional methods. Experimental results show that D2SRT meets the bandwidth requirements for most spiking neural networks (SNNs) and achieves exceptionally low dynamic power consumption.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"3-6"},"PeriodicalIF":2.0,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Embedded Systems Letters Publication Information IEEE嵌入式系统通讯出版信息
IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-17 DOI: 10.1109/LES.2025.3544886
{"title":"IEEE Embedded Systems Letters Publication Information","authors":"","doi":"10.1109/LES.2025.3544886","DOIUrl":"https://doi.org/10.1109/LES.2025.3544886","url":null,"abstract":"","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 2","pages":"C4-C4"},"PeriodicalIF":1.7,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10969147","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TinyTNAS: Time-Bound, GPU-Independent Hardware-Aware Neural Architecture Search for TinyML Time-Series Classification TinyTNAS:时间限制,gpu独立的硬件感知神经架构搜索的TinyML时间序列分类
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-17 DOI: 10.1109/LES.2025.3561870
Bidyut Saha;Riya Samanta;Ram Babu Roy;Soumya K. Ghosh
We present tiny time-series neural architecture search (TinyTNAS), a hardware-aware neural architecture search (NAS) framework optimized for efficient execution on CPUs, eliminating the need for costly GPUs. Traditional NAS methods often depend on reinforcement learning or evolutionary algorithms, requiring significant GPU resources and search time, which may be inaccessible to many machine learning researchers and practitioners. TinyTNAS addresses these limitations with an intelligent grid search approach that drastically reduces search time from hours to minutes, operating seamlessly on CPUs. It enables scalable model generation tailored for resource-constrained devices, optimizing neural networks within stringent constraints on RAM, Flash, and MAC operations. TinyTNAS also supports time-bound searches, ensuring rapid and efficient architecture discovery. Experiments on benchmark datasets, including UCIHAR, PAMAP2, WISDM, MIT-BIH, and PTB-ECG, demonstrate its ability to achieve state-of-the-art accuracy while significantly reducing resource usage and latency compared to expert-designed architectures. Furthermore, it surpasses GPU-dependent hardware-aware NAS methods based on reinforcement learning and evolutionary algorithms by drastically reducing search time. The code is publicly available at https://github.com/BidyutSaha/TinyTNAS.git.
我们提出了微型时间序列神经架构搜索(TinyTNAS),这是一种硬件感知的神经架构搜索(NAS)框架,针对cpu上的高效执行进行了优化,从而消除了对昂贵的gpu的需求。传统的NAS方法通常依赖于强化学习或进化算法,需要大量的GPU资源和搜索时间,这对于许多机器学习研究人员和从业者来说可能是无法实现的。TinyTNAS通过智能网格搜索方法解决了这些限制,该方法将搜索时间从几小时大幅减少到几分钟,在cpu上无缝运行。它支持为资源受限设备量身定制的可扩展模型生成,在RAM、Flash和MAC操作的严格限制下优化神经网络。TinyTNAS还支持有时间限制的搜索,确保快速有效地发现架构。在包括UCIHAR、PAMAP2、WISDM、MIT-BIH和pdb - ecg在内的基准数据集上进行的实验证明,与专家设计的架构相比,它能够实现最先进的精度,同时显著降低资源使用和延迟。此外,它通过大幅减少搜索时间,超越了基于强化学习和进化算法的gpu依赖硬件感知NAS方法。该代码可在https://github.com/BidyutSaha/TinyTNAS.git上公开获得。
{"title":"TinyTNAS: Time-Bound, GPU-Independent Hardware-Aware Neural Architecture Search for TinyML Time-Series Classification","authors":"Bidyut Saha;Riya Samanta;Ram Babu Roy;Soumya K. Ghosh","doi":"10.1109/LES.2025.3561870","DOIUrl":"https://doi.org/10.1109/LES.2025.3561870","url":null,"abstract":"We present tiny time-series neural architecture search (TinyTNAS), a hardware-aware neural architecture search (NAS) framework optimized for efficient execution on CPUs, eliminating the need for costly GPUs. Traditional NAS methods often depend on reinforcement learning or evolutionary algorithms, requiring significant GPU resources and search time, which may be inaccessible to many machine learning researchers and practitioners. TinyTNAS addresses these limitations with an intelligent grid search approach that drastically reduces search time from hours to minutes, operating seamlessly on CPUs. It enables scalable model generation tailored for resource-constrained devices, optimizing neural networks within stringent constraints on RAM, Flash, and MAC operations. TinyTNAS also supports time-bound searches, ensuring rapid and efficient architecture discovery. Experiments on benchmark datasets, including UCIHAR, PAMAP2, WISDM, MIT-BIH, and PTB-ECG, demonstrate its ability to achieve state-of-the-art accuracy while significantly reducing resource usage and latency compared to expert-designed architectures. Furthermore, it surpasses GPU-dependent hardware-aware NAS methods based on reinforcement learning and evolutionary algorithms by drastically reducing search time. The code is publicly available at <uri>https://github.com/BidyutSaha/TinyTNAS.git</uri>.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"69-72"},"PeriodicalIF":2.0,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Use of NVIDIA Multi-Instance GPU in the Automotive Domain 评估NVIDIA多实例GPU在汽车领域的使用
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-14 DOI: 10.1109/LES.2025.3560428
Javier Barrera;Leonidas Kosmidis;Jaume Abella;Francisco J. Cazorla
The search for resource isolation and segregation is relentless to increase execution time determinism (ETD) as an essential feature for a successful certification process in embedded critical domains like automotive. In those domains, the advent of advanced software-controlled features requires unprecedented computing performance that calls for the use of acceleration hardware, with GPUs having a prominent position. The latest NVIDIA GPUs generations (Ampere, Hopper, Blackwell) feature multi-instance GPU (MIG), a mechanism that allows partitioning the GPU into fully isolated GPU instances, each with its own memory, cache, and computing resources. Despite the clear benefits of MIG on ETD, the latest NVIDIA automotive GPUs do not implement it. In this work, we first empirically analyze the benefits of MIG in a nonautomotive GPU showing the main traits in its use to improve ETD. And second, we identify the potential reasons precluding the deployment of MIG for automotive GPUs: automotive market specific needs, and the difference between GPU and memory technologies used in high-performance GPUs, which implement MIG, and automotive GPUs that lack it.
为了提高执行时间确定性(ETD),对资源隔离和隔离的搜索是不懈的,这是汽车等嵌入式关键领域中成功认证过程的基本特征。在这些领域中,高级软件控制功能的出现需要前所未有的计算性能,这就需要使用加速硬件,gpu具有突出的地位。最新一代NVIDIA GPU (Ampere、Hopper、Blackwell)采用多实例GPU (MIG),这种机制允许将GPU划分为完全隔离的GPU实例,每个实例都有自己的内存、缓存和计算资源。尽管MIG在ETD上有明显的好处,但最新的NVIDIA汽车gpu并没有实现它。在这项工作中,我们首先实证分析了MIG在非汽车GPU中的好处,显示了其用于改善ETD的主要特征。其次,我们确定了排除在汽车GPU中部署MIG的潜在原因:汽车市场的特定需求,以及实现MIG的高性能GPU中使用的GPU和内存技术与缺乏MIG的汽车GPU之间的差异。
{"title":"Assessing the Use of NVIDIA Multi-Instance GPU in the Automotive Domain","authors":"Javier Barrera;Leonidas Kosmidis;Jaume Abella;Francisco J. Cazorla","doi":"10.1109/LES.2025.3560428","DOIUrl":"https://doi.org/10.1109/LES.2025.3560428","url":null,"abstract":"The search for resource isolation and segregation is relentless to increase execution time determinism (ETD) as an essential feature for a successful certification process in embedded critical domains like automotive. In those domains, the advent of advanced software-controlled features requires unprecedented computing performance that calls for the use of acceleration hardware, with GPUs having a prominent position. The latest NVIDIA GPUs generations (Ampere, Hopper, Blackwell) feature multi-instance GPU (MIG), a mechanism that allows partitioning the GPU into fully isolated GPU instances, each with its own memory, cache, and computing resources. Despite the clear benefits of MIG on ETD, the latest NVIDIA automotive GPUs do not implement it. In this work, we first empirically analyze the benefits of MIG in a nonautomotive GPU showing the main traits in its use to improve ETD. And second, we identify the potential reasons precluding the deployment of MIG for automotive GPUs: automotive market specific needs, and the difference between GPU and memory technologies used in high-performance GPUs, which implement MIG, and automotive GPUs that lack it.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"64-68"},"PeriodicalIF":2.0,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SIMMAC: SRAM IMC-Based Multibit Multiplication With Analog Carry Computation SIMMAC:基于SRAM imc的多比特乘法与模拟进位计算
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-09 DOI: 10.1109/LES.2025.3559208
Chithambara Moorthii J.;Deepak Verma;Richa Mishra;Harshit Bansal;Sounak Dey;Arijit Mukherjee;Arpan Pal;Manan Suri
Several applications, ranging from artificial intelligence to encryption, require dense multibit matrix multiplications. With the advent of big-data applications and edge deployment, a recent paradigm shift focuses on energy-efficient computation methodologies such as In-Memory Computing (IMC). In this work, we propose Typo SRAM IMC-based Multibit Multiplication with Analog Carry Computation (SIMMAC), a novel 8T SRAM-based IMC accelerator for multibit multiplication with reconfigurable bit-precision. To address the present-day challenges of IMC architectures, we propose a novel input and weight mapping strategy along with analog carry addition for in-memory computation. The proposed input and weight mapping strategy renders the implementation to be DAC-less, hence boosting the performance of the IMC Macro in terms of area and power. The novel analog carry addition methodology computes the multibit product within the IMC Macro, eliminating the need for peripheral digital shift-and-add circuits. With the proposed Convolutional Neural Network(CNN) workload mapping analyzed in this study, our architecture executes the Matrix Vector Multiplication (MVM) across all tiles in a single product cycle of 40ns. Our architecture achieves 98% accuracy for MNIST classification and 819.2 GOPS and 56.5 TOPS/W at 200 MHz operating frequency at TSMC 65 nm technology node.
一些应用,从人工智能到加密,都需要密集的多比特矩阵乘法。随着大数据应用和边缘部署的出现,最近的范式转变集中在节能计算方法上,如内存计算(IMC)。在这项工作中,我们提出了基于Typo SRAM IMC的多位乘法与模拟进位计算(SIMMAC),这是一种新型的基于8T SRAM的IMC加速器,用于具有可重构比特精度的多位乘法。为了解决当前IMC架构的挑战,我们提出了一种新的输入和权重映射策略以及用于内存计算的模拟进位加法。所提出的输入和权重映射策略使得实现无dac,从而提高了IMC Macro在面积和功耗方面的性能。新颖的模拟进位加法方法计算IMC宏内的多位乘积,消除了对外围数字移位加电路的需要。在本研究中分析了所提出的卷积神经网络(CNN)工作负载映射,我们的架构在40ns的单个产品周期内在所有瓷砖上执行矩阵向量乘法(MVM)。我们的架构在TSMC 65nm技术节点上,在200 MHz工作频率下实现了98%的MNIST分类准确率和819.2 GOPS和56.5 TOPS/W。
{"title":"SIMMAC: SRAM IMC-Based Multibit Multiplication With Analog Carry Computation","authors":"Chithambara Moorthii J.;Deepak Verma;Richa Mishra;Harshit Bansal;Sounak Dey;Arijit Mukherjee;Arpan Pal;Manan Suri","doi":"10.1109/LES.2025.3559208","DOIUrl":"https://doi.org/10.1109/LES.2025.3559208","url":null,"abstract":"Several applications, ranging from artificial intelligence to encryption, require dense multibit matrix multiplications. With the advent of big-data applications and edge deployment, a recent paradigm shift focuses on energy-efficient computation methodologies such as In-Memory Computing (IMC). In this work, we propose Typo SRAM IMC-based Multibit Multiplication with Analog Carry Computation (SIMMAC), a novel 8T SRAM-based IMC accelerator for multibit multiplication with reconfigurable bit-precision. To address the present-day challenges of IMC architectures, we propose a novel input and weight mapping strategy along with analog carry addition for in-memory computation. The proposed input and weight mapping strategy renders the implementation to be DAC-less, hence boosting the performance of the IMC Macro in terms of area and power. The novel analog carry addition methodology computes the multibit product within the IMC Macro, eliminating the need for peripheral digital shift-and-add circuits. With the proposed Convolutional Neural Network(CNN) workload mapping analyzed in this study, our architecture executes the Matrix Vector Multiplication (MVM) across all tiles in a single product cycle of 40ns. Our architecture achieves 98% accuracy for MNIST classification and 819.2 GOPS and 56.5 TOPS/W at 200 MHz operating frequency at TSMC 65 nm technology node.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"31-35"},"PeriodicalIF":2.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Internal Communication of Compute-in-Memory-Based AI Accelerator 基于内存计算的AI加速器内部通信优化
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-09 DOI: 10.1109/LES.2025.3559333
Letian Huang;Wenxu Cao;Linhan Sun;Zeyu Li;Ruitai Wang;Junyi Song;Shuyan Jiang
In compute-in-memory (CIM)-based system, the weights of neural network models need to be mapped to the memory array, and the mapping policy has a huge impact on the performance of the system. In this letter, existing weight mapping policies are analysed and categorized into two types: 1) position-wise and 2) channel-wise mapping. Channel-wise policy causes noncontinuous memory address at its output, while position-wise policy suffers from its large amount of input data and the problem of data concatenation. After that, a novel weight mapping policy, named as mixed dimension mapping, is proposed to overcome the limitations of the existing policies. Experimental results show that it reduces the communication load of the system by 14%–32% and avoids data concatenation completely.
在基于内存计算(CIM)的系统中,神经网络模型的权值需要映射到存储阵列中,映射策略对系统的性能影响很大。本文分析了现有的权重映射策略,并将其分为两种类型:1)位置映射和2)通道映射。通道式策略的输出会导致内存地址不连续,而位置式策略的输入数据量大,并且存在数据拼接的问题。在此基础上,提出了一种新的权重映射策略,即混合维映射策略,克服了现有策略的局限性。实验结果表明,该方法使系统的通信负荷降低了14% ~ 32%,完全避免了数据拼接。
{"title":"Optimizing Internal Communication of Compute-in-Memory-Based AI Accelerator","authors":"Letian Huang;Wenxu Cao;Linhan Sun;Zeyu Li;Ruitai Wang;Junyi Song;Shuyan Jiang","doi":"10.1109/LES.2025.3559333","DOIUrl":"https://doi.org/10.1109/LES.2025.3559333","url":null,"abstract":"In compute-in-memory (CIM)-based system, the weights of neural network models need to be mapped to the memory array, and the mapping policy has a huge impact on the performance of the system. In this letter, existing weight mapping policies are analysed and categorized into two types: 1) position-wise and 2) channel-wise mapping. Channel-wise policy causes noncontinuous memory address at its output, while position-wise policy suffers from its large amount of input data and the problem of data concatenation. After that, a novel weight mapping policy, named as mixed dimension mapping, is proposed to overcome the limitations of the existing policies. Experimental results show that it reduces the communication load of the system by 14%–32% and avoids data concatenation completely.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"81-84"},"PeriodicalIF":2.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Behavior-Driven Thermal Management Framework in Heterogeneous Multicore Processors 异构多核处理器中的自适应行为驱动热管理框架
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-07 DOI: 10.1109/LES.2025.3558314
Milanpreet Kaur;Karminder Singh;Suman Kumar
In modern computing environments, heterogeneous multicore processors are increasingly used to balance performance and energy efficiency. However, as processor architectures become more complex and workloads increase, traditional dynamic voltage and frequency scaling (DVFS) methods face challenges in ensuring thermal stability without significant performance tradeoffs in high-performance computing applications. This letter proposes a scalable adaptive thermal management framework that leverages phase-based thermal detection of workloads alongside adaptive migration techniques. The framework dynamically detects the thermal phases of running applications and optimally allocates tasks and threads across cores based on thermal characteristics and workload demands. The proposed framework on the Apalis iMX8, evaluated using the PARSEC benchmark, reduces average and peak temperatures by $16.7^{circ } C$ and $32.5^{circ } C$ , respectively, while enhancing performance by 13.5% compared to DVFS-based dynamic thermal management techniques. It also outperforms methods, such as compiler-assisted reinforcement learning for thermal-aware task scheduling and DVFS and PTS, demonstrating superior efficiency and adaptability in thermal management.
在现代计算环境中,异构多核处理器越来越多地用于平衡性能和能源效率。然而,随着处理器架构变得越来越复杂和工作负载的增加,传统的动态电压和频率缩放(DVFS)方法在高性能计算应用中面临着在不显著的性能权衡的情况下确保热稳定性的挑战。这封信提出了一个可扩展的自适应热管理框架,该框架利用基于阶段的工作负载热检测以及自适应迁移技术。该框架动态检测正在运行的应用程序的热阶段,并根据热特性和工作负载需求在核心之间最佳地分配任务和线程。在Apalis iMX8上提出的框架,使用PARSEC基准进行评估,平均和峰值温度分别降低了16.7美元和32.5美元,而与基于dvfs的动态热管理技术相比,性能提高了13.5%。它也优于其他方法,如用于热感知任务调度的编译器辅助强化学习以及DVFS和PTS,在热管理方面表现出卓越的效率和适应性。
{"title":"Adaptive Behavior-Driven Thermal Management Framework in Heterogeneous Multicore Processors","authors":"Milanpreet Kaur;Karminder Singh;Suman Kumar","doi":"10.1109/LES.2025.3558314","DOIUrl":"https://doi.org/10.1109/LES.2025.3558314","url":null,"abstract":"In modern computing environments, heterogeneous multicore processors are increasingly used to balance performance and energy efficiency. However, as processor architectures become more complex and workloads increase, traditional dynamic voltage and frequency scaling (DVFS) methods face challenges in ensuring thermal stability without significant performance tradeoffs in high-performance computing applications. This letter proposes a scalable adaptive thermal management framework that leverages phase-based thermal detection of workloads alongside adaptive migration techniques. The framework dynamically detects the thermal phases of running applications and optimally allocates tasks and threads across cores based on thermal characteristics and workload demands. The proposed framework on the Apalis iMX8, evaluated using the PARSEC benchmark, reduces average and peak temperatures by <inline-formula> <tex-math>$16.7^{circ } C$ </tex-math></inline-formula> and <inline-formula> <tex-math>$32.5^{circ } C$ </tex-math></inline-formula>, respectively, while enhancing performance by 13.5% compared to DVFS-based dynamic thermal management techniques. It also outperforms methods, such as compiler-assisted reinforcement learning for thermal-aware task scheduling and DVFS and PTS, demonstrating superior efficiency and adaptability in thermal management.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"18 1","pages":"40-43"},"PeriodicalIF":2.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From MLIR to Scheduled CDFG: A Design Flow for Hardware Resource Estimation 从MLIR到计划CDFG:一个硬件资源评估的设计流程
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-29 DOI: 10.1109/LES.2025.3575017
Joel A. Quevedo;Yazmin Maldonado
Efficient early stages exploration in hardware design can significantly enhance design quality and reduce development time. This letter introduces a novel methodology that leverages the multilevel intermediate representation (MLIR) to extract control and data flow graphs (CDFGs) with early-stage resource estimates for area, delay, and power consumption. By employing Polygeist for C-to-MLIR conversion coupled with Graphviz for visualization, we generate structured CDFGs and then apply three scheduling algorithms: 1) ASAP; 2) ALAP; and 3) Random. Evaluated through two case studies, our approach produces valid scheduled graphs and demonstrates how both code structure and scheduling strategy critically impact hardware resource utilization and performance. This work sets the stage for resource-aware design space exploration using MLIR, enabling designers to evaluate configurations and make informed tradeoffs prior to time-consuming synthesis processes.
在硬件设计的早期阶段进行有效的探索,可以显著提高设计质量,缩短开发时间。本文介绍了一种新的方法,该方法利用多层中间表示(MLIR)提取控制和数据流图(cdfg),并对面积、延迟和功耗进行早期资源估计。利用Polygeist进行C-to-MLIR转换,结合Graphviz进行可视化,生成结构化cdfg,并应用三种调度算法:1)ASAP;2)阿拉普;3)随机。通过两个案例研究,我们的方法产生了有效的调度图,并演示了代码结构和调度策略如何对硬件资源利用率和性能产生重大影响。这项工作为使用MLIR进行资源感知设计空间探索奠定了基础,使设计人员能够在耗时的综合过程之前评估配置并做出明智的权衡。
{"title":"From MLIR to Scheduled CDFG: A Design Flow for Hardware Resource Estimation","authors":"Joel A. Quevedo;Yazmin Maldonado","doi":"10.1109/LES.2025.3575017","DOIUrl":"https://doi.org/10.1109/LES.2025.3575017","url":null,"abstract":"Efficient early stages exploration in hardware design can significantly enhance design quality and reduce development time. This letter introduces a novel methodology that leverages the multilevel intermediate representation (MLIR) to extract control and data flow graphs (CDFGs) with early-stage resource estimates for area, delay, and power consumption. By employing Polygeist for C-to-MLIR conversion coupled with Graphviz for visualization, we generate structured CDFGs and then apply three scheduling algorithms: 1) ASAP; 2) ALAP; and 3) Random. Evaluated through two case studies, our approach produces valid scheduled graphs and demonstrates how both code structure and scheduling strategy critically impact hardware resource utilization and performance. This work sets the stage for resource-aware design space exploration using MLIR, enabling designers to evaluate configurations and make informed tradeoffs prior to time-consuming synthesis processes.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"423-426"},"PeriodicalIF":2.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Indoor Collaborative Robot Exploration: A Distributed Market-Based Approach 室内协作机器人探索:分布式市场方法
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-27 DOI: 10.1109/LES.2025.3555547
Ricardo Ercoli;Fausto Navadian;Joaquin Urrisa;Pablo Monzón;Facundo Benavides
Collaborative robotic exploration relies on multiple robots working together to survey an unknown environment. This letter presents the implementation of a collaborative fleet of robots designed to perform autonomous 2-D indoor mapping. The main contributions are: 1) an original solution to the problem of distributing multiple tasks among multiple robots, implemented as a distributed version of the auction mechanism and 2) the release of the code through a public repository.
协作机器人探索依赖于多个机器人一起工作来调查未知环境。这封信介绍了一个协作机器人车队的实施,旨在执行自主的二维室内测绘。主要贡献有:1)解决了在多个机器人之间分配多个任务的问题的原始解决方案,作为拍卖机制的分布式版本实现;2)通过公共存储库发布代码。
{"title":"Indoor Collaborative Robot Exploration: A Distributed Market-Based Approach","authors":"Ricardo Ercoli;Fausto Navadian;Joaquin Urrisa;Pablo Monzón;Facundo Benavides","doi":"10.1109/LES.2025.3555547","DOIUrl":"https://doi.org/10.1109/LES.2025.3555547","url":null,"abstract":"Collaborative robotic exploration relies on multiple robots working together to survey an unknown environment. This letter presents the implementation of a collaborative fleet of robots designed to perform autonomous 2-D indoor mapping. The main contributions are: 1) an original solution to the problem of distributing multiple tasks among multiple robots, implemented as a distributed version of the auction mechanism and 2) the release of the code through a public repository.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"402-405"},"PeriodicalIF":2.0,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Embedded Systems Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1