Embedded Hardware-Efficient FPGA Architecture for SVM Learning and Inference

IF 3.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Access Pub Date : 2025-04-18 DOI:10.1109/ACCESS.2025.3562453

B. B. Shabarinath;Muralidhar Pullakandam

{"title":"Embedded Hardware-Efficient FPGA Architecture for SVM Learning and Inference","authors":"B. B. Shabarinath;Muralidhar Pullakandam","doi":"10.1109/ACCESS.2025.3562453","DOIUrl":null,"url":null,"abstract":"Edge computing allows to do AI processing on devices with limited resources, but the challenge remains high computational costs followed by the energy limitations of such devices making on-device machine learning inefficient, especially for Support Vector Machine (SVM) classifiers. Although SVM classifiers are generally very accurate, they require solving a quadratic optimization problem, making their implementation in real-time embedded devices challenging. While Sequential Minimal Optimization (SMO) has enhanced the efficiency of SVM training, traditional implementations still suffer from high computational cost. In this paper, we propose Parallel SMO, a new algorithm that selects multiple violating pairs in each iteration, allowing batch-wise updates that enhance convergence speed and optimize parallel computation. By buffering kernel values, it minimizes redundant computations, leading to improved memory efficiency and faster SVM training on FPGA architectures. In addition, we present a embedded hardware-efficient FPGA architecture for the integrated SVM learning based on Parallel SMO with SVM inference. It consists of SVM controller that schedules the operations of each clock cycle such that computations and memory access happen concurrently. The dynamic pipeline scheduling employ parameterized modules to schedule linear or nonlinear kernels and produce dimension-based reconfigurable blocks. A configuration signal turns on corresponding sub-blocks and clock-gating unused ones, thus enhancing resource utilization efficiency, energy efficiency, and overall performance. In several benchmarking data sets, the scheme reduces clock cycles per iteration consistently and improves throughput (up to 2427 iterations per second). It achieves up to 98% accuracy in classification with low power consumption, as reflected by training power of <inline-formula> <tex-math>$47 mW$ </tex-math></inline-formula> and high energy efficiency (up to <inline-formula> <tex-math>$51.64e+3$ </tex-math></inline-formula> iterations per joule). With the assistance of an adaptive kernel datapath, parallel error update execution, and best-pair selection, the scheme facilitates faster convergence, higher throughput, and on-chip inference with resource efficiency maintained.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"68930-68947"},"PeriodicalIF":3.6000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10969767","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10969767/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Edge computing allows to do AI processing on devices with limited resources, but the challenge remains high computational costs followed by the energy limitations of such devices making on-device machine learning inefficient, especially for Support Vector Machine (SVM) classifiers. Although SVM classifiers are generally very accurate, they require solving a quadratic optimization problem, making their implementation in real-time embedded devices challenging. While Sequential Minimal Optimization (SMO) has enhanced the efficiency of SVM training, traditional implementations still suffer from high computational cost. In this paper, we propose Parallel SMO, a new algorithm that selects multiple violating pairs in each iteration, allowing batch-wise updates that enhance convergence speed and optimize parallel computation. By buffering kernel values, it minimizes redundant computations, leading to improved memory efficiency and faster SVM training on FPGA architectures. In addition, we present a embedded hardware-efficient FPGA architecture for the integrated SVM learning based on Parallel SMO with SVM inference. It consists of SVM controller that schedules the operations of each clock cycle such that computations and memory access happen concurrently. The dynamic pipeline scheduling employ parameterized modules to schedule linear or nonlinear kernels and produce dimension-based reconfigurable blocks. A configuration signal turns on corresponding sub-blocks and clock-gating unused ones, thus enhancing resource utilization efficiency, energy efficiency, and overall performance. In several benchmarking data sets, the scheme reduces clock cycles per iteration consistently and improves throughput (up to 2427 iterations per second). It achieves up to 98% accuracy in classification with low power consumption, as reflected by training power of

$47 mW$

and high energy efficiency (up to

$51.64e+3$

iterations per joule). With the assistance of an adaptive kernel datapath, parallel error update execution, and best-pair selection, the scheme facilitates faster convergence, higher throughput, and on-chip inference with resource efficiency maintained.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于支持向量机学习和推理的嵌入式硬件高效FPGA架构

边缘计算允许在资源有限的设备上进行人工智能处理，但挑战仍然是高计算成本，其次是此类设备的能量限制，使得设备上的机器学习效率低下，特别是对于支持向量机（SVM）分类器。虽然SVM分类器通常非常准确，但它们需要解决二次优化问题，这使得它们在实时嵌入式设备中的实现具有挑战性。虽然序列最小优化（SMO）提高了支持向量机的训练效率，但传统的实现方式仍然存在计算成本高的问题。在本文中，我们提出了一种新的并行SMO算法，该算法在每次迭代中选择多个违反对，允许批量更新，提高收敛速度并优化并行计算。通过缓冲内核值，它最大限度地减少冗余计算，从而提高内存效率和在FPGA架构上更快地训练SVM。此外，我们还提出了一种基于并行SMO和支持向量机推理的集成支持向量机学习的嵌入式硬件高效FPGA架构。它由SVM控制器组成，该控制器调度每个时钟周期的操作，使计算和内存访问同时发生。动态管道调度采用参数化模块对线性或非线性核进行调度，生成基于维数的可重构块。配置信号打开相应的子块并对未使用的子块进行时钟控制，从而提高资源利用效率、能源效率和整体性能。在几个基准测试数据集中，该方案一致地减少了每次迭代的时钟周期，并提高了吞吐量（每秒最多2427次迭代）。它在低功耗的情况下实现了高达98%的分类准确率，这体现在47 mW的训练功率和高能效（高达51.64e+每焦耳3次迭代）上。在自适应内核数据路径、并行错误更新执行和最佳对选择的帮助下，该方案在保持资源效率的前提下，实现了更快的收敛、更高的吞吐量和片上推理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.