首页 > 最新文献

IEEE Embedded Systems Letters最新文献

英文 中文
MdCSR: A Memory-Efficient Sparse Matrix Compression Format MdCSR:一种内存高效的稀疏矩阵压缩格式
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-17 DOI: 10.1109/LES.2025.3598189
G. Noble;S. Nalesh;S. Kala;Salim Ullah;Akash Kumar
Efficient representation of sparse matrices is critical for reducing memory usage and improving performance in hardware-accelerated computing systems. This letter presents memory-efficient delta-compressed storage row (MdCSR), a novel sparse matrix format designed to improve both storage efficiency and execution speed. MdCSR replaces absolute column indices with compact relative offsets and selectively applies delta encoding, resulting in a more compact index structure. Compared to traditional formats, it achieves an average of 15.45% memory savings over compressed sparse row (CSR), 52.77% over dCSR, and around 20% reduction in execution time. A dedicated architecture for CSR to MdCSR compression is also presented, optimized for real-time and low-overhead FPGA deployment.
在硬件加速计算系统中,稀疏矩阵的有效表示对于减少内存使用和提高性能至关重要。这封信介绍了内存高效的增量压缩存储行(MdCSR),一种新的稀疏矩阵格式,旨在提高存储效率和执行速度。MdCSR用紧凑的相对偏移量替换绝对列索引,并选择性地应用增量编码,从而产生更紧凑的索引结构。与传统格式相比,它比压缩稀疏行(CSR)平均节省15.45%的内存,比dCSR平均节省52.77%的内存,执行时间减少了大约20%。还提出了一种用于CSR到MdCSR压缩的专用架构,针对实时和低开销的FPGA部署进行了优化。
{"title":"MdCSR: A Memory-Efficient Sparse Matrix Compression Format","authors":"G. Noble;S. Nalesh;S. Kala;Salim Ullah;Akash Kumar","doi":"10.1109/LES.2025.3598189","DOIUrl":"https://doi.org/10.1109/LES.2025.3598189","url":null,"abstract":"Efficient representation of sparse matrices is critical for reducing memory usage and improving performance in hardware-accelerated computing systems. This letter presents memory-efficient delta-compressed storage row (MdCSR), a novel sparse matrix format designed to improve both storage efficiency and execution speed. MdCSR replaces absolute column indices with compact relative offsets and selectively applies delta encoding, resulting in a more compact index structure. Compared to traditional formats, it achieves an average of 15.45% memory savings over compressed sparse row (CSR), 52.77% over dCSR, and around 20% reduction in execution time. A dedicated architecture for CSR to MdCSR compression is also presented, optimized for real-time and low-overhead FPGA deployment.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"289-292"},"PeriodicalIF":2.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Container-Based Fail-Operational System Architecture for Software-Defined Vehicles 基于容器的软件定义车辆故障操作系统架构
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-17 DOI: 10.1109/LES.2025.3600581
Changjo Cho;Hamin An;Jangho Shin;Jong-Chan Kim
Future software-defined vehicles (SDVs) are expected to employ the zonal architecture with container-based microservices on distributed computing nodes. Such containers should be carefully orchestrated to ensure operational continuity by proper failover mechanisms. For that, we first try to use K3s, a lightweight Kubernetes implementation, as our baseline. However, we found that K3s has significant failover delays (over 5 min by default settings), which makes it unsuitable for safety-critical applications. We thus propose an enhanced health check and failover mechanism by minimized sensor-triggered heartbeat intervals and warm standby container redundancy. Our experiments with realistic containerized applications (i.e., in-cabin pose estimation and on-road lane detection) show that the failover delays are reduced to under 1 s, achieving the real-time performance for safety-critical applications in future SDVs.
未来的软件定义工具(sdv)有望在分布式计算节点上采用基于容器的微服务的分区架构。应该仔细编排这样的容器,以通过适当的故障转移机制确保操作的连续性。为此,我们首先尝试使用K3s,一个轻量级的Kubernetes实现,作为我们的基准。然而,我们发现K3s有明显的故障转移延迟(默认设置超过5分钟),这使得它不适合安全关键型应用程序。因此,我们提出了一种增强的健康检查和故障转移机制,最小化传感器触发的心跳间隔和热备用容器冗余。我们对实际容器化应用(即舱内姿态估计和道路车道检测)的实验表明,故障转移延迟减少到1秒以下,实现了未来sdv中安全关键应用的实时性能。
{"title":"Container-Based Fail-Operational System Architecture for Software-Defined Vehicles","authors":"Changjo Cho;Hamin An;Jangho Shin;Jong-Chan Kim","doi":"10.1109/LES.2025.3600581","DOIUrl":"https://doi.org/10.1109/LES.2025.3600581","url":null,"abstract":"Future software-defined vehicles (SDVs) are expected to employ the zonal architecture with container-based microservices on distributed computing nodes. Such containers should be carefully orchestrated to ensure operational continuity by proper failover mechanisms. For that, we first try to use K3s, a lightweight Kubernetes implementation, as our baseline. However, we found that K3s has significant failover delays (over 5 min by default settings), which makes it unsuitable for safety-critical applications. We thus propose an enhanced health check and failover mechanism by minimized sensor-triggered heartbeat intervals and warm standby container redundancy. Our experiments with realistic containerized applications (i.e., in-cabin pose estimation and on-road lane detection) show that the failover delays are reduced to under 1 s, achieving the real-time performance for safety-critical applications in future SDVs.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"293-296"},"PeriodicalIF":2.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Upcoming Era of Specialized Models 即将到来的专业模型时代
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-17 DOI: 10.1109/LES.2025.3614406
Aviral Shrivastava
{"title":"The Upcoming Era of Specialized Models","authors":"Aviral Shrivastava","doi":"10.1109/LES.2025.3614406","DOIUrl":"https://doi.org/10.1109/LES.2025.3614406","url":null,"abstract":"","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"288-288"},"PeriodicalIF":2.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11206580","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting Nonequivalence in Neural Networks Through In-Distribution Counterexample Generation 分布内反例生成方法检测神经网络非等价性
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-17 DOI: 10.1109/LES.2025.3600585
Dina A. Moussa;Michael Hefenbrock;Mehdi Tahoori
neural networks (NNs) have made profound achievements in various safety-critical applications such as healthcare, medical devices, and automotive. These NN models are usually trained using cloud systems; however, due to latency, privacy, and bandwidth concerns, inference is performed on edge devices. Consequently, the model size is often reduced through pruning and quantization to map the cloud-trained models to edge artificial intelligence hardware. To ensure that the reduced models maintain the integrity of the original, larger models, detecting inequivalences is crucial. In this letter, we focus on inequivalence detection by identifying cases where the behavior of the reduced model diverges from the original model. This is achieved by formulating an optimization problem to maximize the difference between the two models. In contrast to the related work, our proposed approach is agnostic to the choice of activation function and can be applied to networks utilizing a wide variety of nonlinearities. Furthermore, it considers only counterexamples that are in range of the original data, the so-called In Distribution, as only in these regions, the model can be considered properly specified. The experimental results showed that the found counterexamples were able to differentiate models for various NN architectures and datasets.
神经网络在医疗保健、医疗设备和汽车等各种安全关键应用中取得了深刻的成就。这些神经网络模型通常使用云系统进行训练;但是,由于延迟、隐私和带宽问题,推断是在边缘设备上执行的。因此,通常通过修剪和量化将云训练的模型映射到边缘人工智能硬件来减小模型大小。为了确保简化后的模型保持原始大模型的完整性,检测不等价是至关重要的。在这封信中,我们通过识别简化模型的行为偏离原始模型的情况来关注不等价检测。这是通过制定一个优化问题来最大化两个模型之间的差异来实现的。与相关工作相反,我们提出的方法与激活函数的选择无关,可以应用于利用各种非线性的网络。此外,它只考虑在原始数据范围内的反例,即所谓的in分布,因为只有在这些区域中,模型才能被认为是适当指定的。实验结果表明,所发现的反例能够区分不同神经网络架构和数据集的模型。
{"title":"Detecting Nonequivalence in Neural Networks Through In-Distribution Counterexample Generation","authors":"Dina A. Moussa;Michael Hefenbrock;Mehdi Tahoori","doi":"10.1109/LES.2025.3600585","DOIUrl":"https://doi.org/10.1109/LES.2025.3600585","url":null,"abstract":"neural networks (NNs) have made profound achievements in various safety-critical applications such as healthcare, medical devices, and automotive. These NN models are usually trained using cloud systems; however, due to latency, privacy, and bandwidth concerns, inference is performed on edge devices. Consequently, the model size is often reduced through pruning and quantization to map the cloud-trained models to edge artificial intelligence hardware. To ensure that the reduced models maintain the integrity of the original, larger models, detecting inequivalences is crucial. In this letter, we focus on inequivalence detection by identifying cases where the behavior of the reduced model diverges from the original model. This is achieved by formulating an optimization problem to maximize the difference between the two models. In contrast to the related work, our proposed approach is agnostic to the choice of activation function and can be applied to networks utilizing a wide variety of nonlinearities. Furthermore, it considers only counterexamples that are in range of the original data, the so-called In Distribution, as only in these regions, the model can be considered properly specified. The experimental results showed that the found counterexamples were able to differentiate models for various NN architectures and datasets.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"297-300"},"PeriodicalIF":2.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 340- μ W TinyML Using LUT-Based Reservoir Computing on Low-Cost FPGAs 基于lut的低成本fpga储层计算340 μ W TinyML
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LES.2025.3598209
Kanta Yoshioka;Hakaru Tamukoh
We propose a reservoir computing (RC) system that operates under extremely strict power consumption constraints of less than $340~{mu }$ W, which enables continuous operation for one year on a LR6 battery. By combining a look-up tables networks based RC (LUTNet-RC) with iCE40 series field-programmable gate arrays (FPGAs), the proposed system achieves high computational accuracy while significantly reducing power consumption compared with conventional TinyML devices. The proposed method achieves 93.1%, 98.6%, and 92.7% accuracy in real TinyML applications such as human activity recognition, epilepsy detection, and electrocardiogram signal analysis, respectively, while operating at about 254 to $335~{mu }$ W power consumption. This work shows that the proposed LUTNet-RC on iCE40 FPGAs are promising solutions for long-term operational machine learning application implementations in battery-powered edge devices.
我们提出了一种储层计算(RC)系统,该系统在非常严格的功耗限制下运行,低于$340~{mu}$ W,可以在LR6电池上连续运行一年。通过将基于查找表网络的RC (LUTNet-RC)与iCE40系列现场可编程门阵列(fpga)相结合,所提出的系统实现了高计算精度,同时与传统的TinyML设备相比显着降低了功耗。该方法在人类活动识别、癫痫检测和心电图信号分析等TinyML实际应用中分别达到93.1%、98.6%和92.7%的准确率,功耗约为254 ~ 335~{mu}$ W。这项工作表明,在iCE40 fpga上提出的LUTNet-RC是在电池供电的边缘设备中实现长期操作机器学习应用的有希望的解决方案。
{"title":"A 340- μ W TinyML Using LUT-Based Reservoir Computing on Low-Cost FPGAs","authors":"Kanta Yoshioka;Hakaru Tamukoh","doi":"10.1109/LES.2025.3598209","DOIUrl":"https://doi.org/10.1109/LES.2025.3598209","url":null,"abstract":"We propose a reservoir computing (RC) system that operates under extremely strict power consumption constraints of less than <inline-formula> <tex-math>$340~{mu }$ </tex-math></inline-formula>W, which enables continuous operation for one year on a LR6 battery. By combining a look-up tables networks based RC (LUTNet-RC) with iCE40 series field-programmable gate arrays (FPGAs), the proposed system achieves high computational accuracy while significantly reducing power consumption compared with conventional TinyML devices. The proposed method achieves 93.1%, 98.6%, and 92.7% accuracy in real TinyML applications such as human activity recognition, epilepsy detection, and electrocardiogram signal analysis, respectively, while operating at about 254 to <inline-formula> <tex-math>$335~{mu }$ </tex-math></inline-formula>W power consumption. This work shows that the proposed LUTNet-RC on iCE40 FPGAs are promising solutions for long-term operational machine learning application implementations in battery-powered edge devices.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"357-360"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11205905","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Efficient FPGA Accelerator DSE via Hierarchical and RM-Guided Methods 基于分层和rm导向方法的高效FPGA加速器DSE研究
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LES.2025.3600555
Chao Shi;Qianyu Cheng;Teng Wang;Chao Wang;Xuehai Zhou
field-programmable gate array (FPGA) accelerator design has gradually become a mainstream acceleration solution, widely applied in fields, such as large language models, deep learning inference, autonomous driving, real-time 3-D scene reconstruction, and embedded intelligent terminals. high-level synthesis (HLS) technology has provided significant support for FPGA accelerator design, greatly improving design efficiency and flexibility. However, manual parameter tuning by designers is still required to achieve optimal performance. Existing research has proposed automated design space exploration (DSE) methods to assist in parameter tuning, but these methods often exhibit low efficiency when dealing with complex HLS designs and, in some cases, fail to function properly. To address this, we present an efficient DSE method guided by hierarchical analysis and rule mining (RM), aimed at tackling more complex design challenges. This approach performs hierarchical analysis of design solutions and integrates RM techniques to optimize the design space search process, enabling efficient exploration of superior design solutions. Experimental results show that our method achieves performance comparable to state-of-the-art (SOTA) techniques, while delivering a speed-up of $3.6{times }$ to $30.4{times }$ . Moreover, it enables the effective exploration of complex design spaces that existing methods struggle to handle.
现场可编程门阵列(FPGA)加速器设计已逐渐成为主流的加速解决方案,广泛应用于大型语言模型、深度学习推理、自动驾驶、实时三维场景重构、嵌入式智能终端等领域。高阶综合(high-level synthesis, HLS)技术为FPGA加速器设计提供了重要的支持,大大提高了设计效率和灵活性。然而,为了达到最佳性能,仍然需要设计者手动调整参数。现有的研究提出了自动化设计空间探索(DSE)方法来辅助参数调优,但这些方法在处理复杂的HLS设计时往往表现出低效率,在某些情况下,无法正常工作。为了解决这个问题,我们提出了一种由层次分析和规则挖掘(RM)指导的有效的DSE方法,旨在解决更复杂的设计挑战。该方法对设计解决方案进行分层分析,并集成RM技术来优化设计空间搜索过程,从而有效地探索卓越的设计解决方案。实验结果表明,我们的方法实现了与最先进的(SOTA)技术相当的性能,同时提供了$3.6{times}$到$30.4{times}$的加速。此外,它能够有效地探索现有方法难以处理的复杂设计空间。
{"title":"Toward Efficient FPGA Accelerator DSE via Hierarchical and RM-Guided Methods","authors":"Chao Shi;Qianyu Cheng;Teng Wang;Chao Wang;Xuehai Zhou","doi":"10.1109/LES.2025.3600555","DOIUrl":"https://doi.org/10.1109/LES.2025.3600555","url":null,"abstract":"field-programmable gate array (FPGA) accelerator design has gradually become a mainstream acceleration solution, widely applied in fields, such as large language models, deep learning inference, autonomous driving, real-time 3-D scene reconstruction, and embedded intelligent terminals. high-level synthesis (HLS) technology has provided significant support for FPGA accelerator design, greatly improving design efficiency and flexibility. However, manual parameter tuning by designers is still required to achieve optimal performance. Existing research has proposed automated design space exploration (DSE) methods to assist in parameter tuning, but these methods often exhibit low efficiency when dealing with complex HLS designs and, in some cases, fail to function properly. To address this, we present an efficient DSE method guided by hierarchical analysis and rule mining (RM), aimed at tackling more complex design challenges. This approach performs hierarchical analysis of design solutions and integrates RM techniques to optimize the design space search process, enabling efficient exploration of superior design solutions. Experimental results show that our method achieves performance comparable to state-of-the-art (SOTA) techniques, while delivering a speed-up of <inline-formula> <tex-math>$3.6{times }$ </tex-math></inline-formula> to <inline-formula> <tex-math>$30.4{times }$ </tex-math></inline-formula>. Moreover, it enables the effective exploration of complex design spaces that existing methods struggle to handle.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"361-364"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WCET-Aware Partitioning and Allocation of Disaggregated Networks for Multicore Systems 多核系统中wcet感知的分解网络分区与分配
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LES.2025.3600584
Junjie Shi;Christian Hakert;Kay Heider;Mario Günzel;Nils Hölscher;Daniel Kuhse;Jian-Jia Chen;Logan Kenwright;Sobhan Chatterjee;Nathan Allen;Partha Roop
The integration of machine learning into safety-critical cyber-physical systems has significantly increased computational demands, which are often met by modern multicore platforms. While complex memory subsystems, including local caches, make it challenging to maintain timing predictability, they also provide opportunities for worst-case execution time (WCET) optimization through improved data locality. To address this, we propose a multicore partitioning and allocation strategy that leverages sparse structures through neural network disaggregation to optimize the WCET. Our evaluation shows that disaggregated neural networks achieve a significantly reduced WCET, compared to fully connected monolithic neural networks of similar size.
将机器学习集成到安全关键的网络物理系统中,大大增加了计算需求,而现代多核平台通常可以满足这些需求。虽然复杂的内存子系统(包括本地缓存)使维护时间可预测性变得具有挑战性,但它们也通过改进数据局部性为最坏情况执行时间(WCET)优化提供了机会。为了解决这个问题,我们提出了一种多核分区和分配策略,该策略利用神经网络分解的稀疏结构来优化WCET。我们的评估表明,与相同大小的完全连接的单片神经网络相比,分解的神经网络实现了显著降低的WCET。
{"title":"WCET-Aware Partitioning and Allocation of Disaggregated Networks for Multicore Systems","authors":"Junjie Shi;Christian Hakert;Kay Heider;Mario Günzel;Nils Hölscher;Daniel Kuhse;Jian-Jia Chen;Logan Kenwright;Sobhan Chatterjee;Nathan Allen;Partha Roop","doi":"10.1109/LES.2025.3600584","DOIUrl":"https://doi.org/10.1109/LES.2025.3600584","url":null,"abstract":"The integration of machine learning into safety-critical cyber-physical systems has significantly increased computational demands, which are often met by modern multicore platforms. While complex memory subsystems, including local caches, make it challenging to maintain timing predictability, they also provide opportunities for worst-case execution time (WCET) optimization through improved data locality. To address this, we propose a multicore partitioning and allocation strategy that leverages sparse structures through neural network disaggregation to optimize the WCET. Our evaluation shows that disaggregated neural networks achieve a significantly reduced WCET, compared to fully connected monolithic neural networks of similar size.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"309-312"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11205907","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond BNNs: Design and Acceleration of Sub-Bit Neural Networks Using RISC-V Custom Functional Units 超越bnn:使用RISC-V自定义功能单元的子比特神经网络的设计和加速
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LES.2025.3600565
Muhammad Sabih;Mohamed Abdo;Frank Hannig;Jürgen Teich
Binary neural networks (BNNs) are known for their minimal memory requirements, making them an attractive choice for resource-constrained environments. SBNN-nps are a more recent advancement that extend the benefits of BNNs by compressing them even further, achieving sub-bit level representations to maximize efficiency. However, effectively compressing and accelerating BNNs presents challenges. In this letter, we propose a novel approach to compress BNNs using a fixed-length compression scheme that can be efficiently decoded at runtime. We then propose RISC-V extensions, implemented as a custom function unit (CFU), to decode compressed weights via a codebook stored on an FPGA on-board memory, followed by XOR and population count operations. This approach achieves a speedup of up to 2× compared to conventional BNNs deployed on the RISC-V softcore, with Significantly less accuracy degradation, and provides a foundation for exploring even higher compression configurations to improve performance further.
二进制神经网络(bnn)以其最小的内存需求而闻名,使其成为资源受限环境的一个有吸引力的选择。SBNN-nps是最近的一项进步,它通过进一步压缩bnn来扩展bnn的优势,实现子比特级表示以最大化效率。然而,有效地压缩和加速bnn提出了挑战。在这封信中,我们提出了一种新的方法来压缩bnn,使用固定长度的压缩方案,可以在运行时有效解码。然后,我们提出RISC-V扩展,作为自定义功能单元(CFU)实现,通过存储在FPGA板载存储器上的码本解码压缩权重,然后进行异或和人口计数操作。与部署在RISC-V软核上的传统bnn相比,这种方法实现了高达2倍的加速,精度下降明显减少,并为探索更高的压缩配置以进一步提高性能奠定了基础。
{"title":"Beyond BNNs: Design and Acceleration of Sub-Bit Neural Networks Using RISC-V Custom Functional Units","authors":"Muhammad Sabih;Mohamed Abdo;Frank Hannig;Jürgen Teich","doi":"10.1109/LES.2025.3600565","DOIUrl":"https://doi.org/10.1109/LES.2025.3600565","url":null,"abstract":"Binary neural networks (BNNs) are known for their minimal memory requirements, making them an attractive choice for resource-constrained environments. SBNN-nps are a more recent advancement that extend the benefits of BNNs by compressing them even further, achieving sub-bit level representations to maximize efficiency. However, effectively compressing and accelerating BNNs presents challenges. In this letter, we propose a novel approach to compress BNNs using a fixed-length compression scheme that can be efficiently decoded at runtime. We then propose RISC-V extensions, implemented as a custom function unit (CFU), to decode compressed weights via a codebook stored on an FPGA on-board memory, followed by XOR and population count operations. This approach achieves a speedup of up to 2× compared to conventional BNNs deployed on the RISC-V softcore, with Significantly less accuracy degradation, and provides a foundation for exploring even higher compression configurations to improve performance further.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"329-332"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formal Modeling and Verification of Generic Credential Management Processes for Industrial Cyber–Physical Systems 工业信息物理系统通用凭证管理过程的形式化建模和验证
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LES.2025.3598202
Julian Göppert;Axel Sikora
Industrial cyber-physical systems (ICPS) face rising cyberattacks, requiring secure credential management also in resource-constrained embedded systems. Standards specifying field level communication of ICPS (e.g., PROFINET or OPC UA) define protocol-specific credential management processes, yet lack formal security verification. We propose a generic model capturing initial security onboarding and automated credential provisioning. Using ProVerif, an automatic symbolic protocol verifier, we formalize certificate-based authentication under a Dolev-Yao adversary, verifying private key secrecy, component authentication, and mutual authentication with the operator domain. Robustness checks confirm resilience against key leakage and highlight the vulnerabilities of the trust on first use concept proposed by the standards. Our model offers the first formal guarantees for secure credential management in ICPS.
工业网络物理系统(ICPS)面临越来越多的网络攻击,在资源受限的嵌入式系统中也需要安全的凭证管理。指定ICPS现场级通信的标准(例如,PROFINET或OPC UA)定义了特定于协议的凭据管理过程,但缺乏正式的安全验证。我们提出了一个通用模型,用于捕获初始安全登录和自动凭证配置。使用ProVerif(一个自动符号协议验证器),我们在Dolev-Yao对手下形式化了基于证书的身份验证,验证私钥保密、组件身份验证以及与操作员域的相互身份验证。鲁棒性检查确认了密钥泄漏的弹性,并突出了标准提出的首次使用信任概念的漏洞。我们的模型为ICPS中的安全凭据管理提供了第一个正式保证。
{"title":"Formal Modeling and Verification of Generic Credential Management Processes for Industrial Cyber–Physical Systems","authors":"Julian Göppert;Axel Sikora","doi":"10.1109/LES.2025.3598202","DOIUrl":"https://doi.org/10.1109/LES.2025.3598202","url":null,"abstract":"Industrial cyber-physical systems (ICPS) face rising cyberattacks, requiring secure credential management also in resource-constrained embedded systems. Standards specifying field level communication of ICPS (e.g., PROFINET or OPC UA) define protocol-specific credential management processes, yet lack formal security verification. We propose a generic model capturing initial security onboarding and automated credential provisioning. Using ProVerif, an automatic symbolic protocol verifier, we formalize certificate-based authentication under a Dolev-Yao adversary, verifying private key secrecy, component authentication, and mutual authentication with the operator domain. Robustness checks confirm resilience against key leakage and highlight the vulnerabilities of the trust on first use concept proposed by the standards. Our model offers the first formal guarantees for secure credential management in ICPS.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"349-352"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Instruction-Level Support for Deterministic Dataflow in Real-Time Systems 实时系统中确定性数据流的指令级支持
IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LES.2025.3600618
Bo Zhang;Yinkang Gao;Caixu Zhao;Xi Li
Ensuring predictable and repeatable behavior in concurrent real-time systems requires dataflow determinism—that is, each consumer task instance must always read data from the same producer instance. While the logical execution time (LET) model enforces this property, its software implementations typically rely on timed I/O or multibuffering protocols. These approaches introduce software complexity, execution overhead, and priority inversion, resulting in increased and unstable task response times, thereby degrading overall schedulability. We propose time-semantic memory instruction (TSMI), a new instruction set extension that embeds logical timing into memory access operations. Unlike existing LET implementations, TSMI enforces dataflow determinism at the instruction level, eliminating the need for memory protocols or access ordering constraints. We develop a TSMI microarchitectural implementation that translates TSMI instructions into standard memory accesses and a programming model that not only captures LET semantics but also enables more expressive, per-access dataflow control. A cycle-accurate RISC-V simulator with TSMI achieves up to 95.36% worst-case response time (WCRT) and 98.88% response time variability (RTV) reduction compared to existing methods.
在并发实时系统中确保可预测和可重复的行为需要数据流确定性——也就是说,每个消费者任务实例必须始终从相同的生产者实例读取数据。虽然逻辑执行时间(LET)模型强制执行此属性,但其软件实现通常依赖于定时I/O或多缓冲协议。这些方法引入了软件复杂性、执行开销和优先级反转,导致任务响应时间增加且不稳定,从而降低了总体可调度性。我们提出了时间语义存储器指令(TSMI),这是一种新的指令集扩展,它将逻辑时序嵌入到存储器访问操作中。与现有的LET实现不同,TSMI在指令级别强制执行数据流确定性,从而消除了对内存协议或访问顺序约束的需求。我们开发了一个TSMI微架构实现,将TSMI指令转换为标准内存访问,并开发了一个编程模型,该模型不仅可以捕获LET语义,还可以实现更具表现力的每次访问数据流控制。与现有方法相比,采用TSMI的周期精确RISC-V模拟器可实现95.36%的最坏情况响应时间(WCRT)和98.88%的响应时间变异性(RTV)降低。
{"title":"Instruction-Level Support for Deterministic Dataflow in Real-Time Systems","authors":"Bo Zhang;Yinkang Gao;Caixu Zhao;Xi Li","doi":"10.1109/LES.2025.3600618","DOIUrl":"https://doi.org/10.1109/LES.2025.3600618","url":null,"abstract":"Ensuring predictable and repeatable behavior in concurrent real-time systems requires dataflow determinism—that is, each consumer task instance must always read data from the same producer instance. While the logical execution time (LET) model enforces this property, its software implementations typically rely on timed I/O or multibuffering protocols. These approaches introduce software complexity, execution overhead, and priority inversion, resulting in increased and unstable task response times, thereby degrading overall schedulability. We propose time-semantic memory instruction (TSMI), a new instruction set extension that embeds logical timing into memory access operations. Unlike existing LET implementations, TSMI enforces dataflow determinism at the instruction level, eliminating the need for memory protocols or access ordering constraints. We develop a TSMI microarchitectural implementation that translates TSMI instructions into standard memory accesses and a programming model that not only captures LET semantics but also enables more expressive, per-access dataflow control. A cycle-accurate RISC-V simulator with TSMI achieves up to 95.36% worst-case response time (WCRT) and 98.88% response time variability (RTV) reduction compared to existing methods.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"341-344"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Embedded Systems Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1