首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
A Hybrid Domain and Pipelined Analog Computing Chain for MVM Computation 一种用于MVM计算的混合域和流水线模拟计算链
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/TVLSI.2024.3439355
Tianzhu Xiong;Yuyang Ye;Xin Si;Jun Yang
In this article, a stream-architecture and pipelined hybrid computing chain is presented to process matrix-vector multiplication (MVM). In each stage of the computing chain, a primary multiply-accumulate (MAC) stage consisting of charge, time, and digital domain processing units makes signed or unsigned $8times 1times 8$ bit MAC operations and MSB quantization. Based on the stream architecture, the length of the computing chain can be configured to fit different MVM applications. In the charge-domain MAC unit, a double-plate sampling and weighted capacitor array with writing yield and efficiency enhanced 7T bitcell and three-step weighting scheme is implemented. To utilize the speed and resolution advantages of time-domain computing, a high linearity voltage-to-time converter (VTC) followed by a dynamic tristate delay chain is proposed to transfer and store MAC values from the charge domain in the time domain. To realize fast analog readout, a folding type and distributed time-to-digital converter (TDC) is proposed. To fully eliminate the offset and variation in the distributed TDC, a specific residue readout timing and back-end calibration scheme are applied. In the digital domain, a double-input and double-clock dynamic D flip-flop is built to realize partial sum transmission and accumulation in a single cycle with low energy and area consumption. Post-simulation results show that this computing chain can achieve 20.89–40.72-TOPS/W energy efficiency and 4.498-TOPS/mm2 throughput.
本文提出了一种流结构和流水线混合计算链来处理矩阵向量乘法。在计算链的每个阶段,由电荷、时间和数字域处理单元组成的主乘法累积(MAC)阶段进行有符号或无符号的$8 × 1 × 8$位MAC操作和MSB量化。基于流架构,计算链的长度可以配置以适应不同的MVM应用。在电荷域MAC单元中,采用双极板采样和加权电容阵列,提高了7T位单元的写入良率和效率,并采用三步加权方案。为了利用时域计算的速度和分辨率优势,提出了一种高线性电压-时间转换器(VTC),其后是一个动态三态延迟链,将电荷域的MAC值传输和存储到时域。为了实现快速的模拟量读出,提出了一种折叠式分布式时间-数字转换器(TDC)。为了充分消除分布式上止点的偏移和变化,采用了特定的剩余读出时间和后端校准方案。在数字域,构建双输入双时钟动态D触发器,实现单周期部分和传输和累加,能耗和面积消耗低。后仿真结果表明,该计算链可以达到20.89 ~ 40.72- tops /W的能效和4.498-TOPS/mm2的吞吐量。
{"title":"A Hybrid Domain and Pipelined Analog Computing Chain for MVM Computation","authors":"Tianzhu Xiong;Yuyang Ye;Xin Si;Jun Yang","doi":"10.1109/TVLSI.2024.3439355","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3439355","url":null,"abstract":"In this article, a stream-architecture and pipelined hybrid computing chain is presented to process matrix-vector multiplication (MVM). In each stage of the computing chain, a primary multiply-accumulate (MAC) stage consisting of charge, time, and digital domain processing units makes signed or unsigned \u0000<inline-formula> <tex-math>$8times 1times 8$ </tex-math></inline-formula>\u0000 bit MAC operations and MSB quantization. Based on the stream architecture, the length of the computing chain can be configured to fit different MVM applications. In the charge-domain MAC unit, a double-plate sampling and weighted capacitor array with writing yield and efficiency enhanced 7T bitcell and three-step weighting scheme is implemented. To utilize the speed and resolution advantages of time-domain computing, a high linearity voltage-to-time converter (VTC) followed by a dynamic tristate delay chain is proposed to transfer and store MAC values from the charge domain in the time domain. To realize fast analog readout, a folding type and distributed time-to-digital converter (TDC) is proposed. To fully eliminate the offset and variation in the distributed TDC, a specific residue readout timing and back-end calibration scheme are applied. In the digital domain, a double-input and double-clock dynamic D flip-flop is built to realize partial sum transmission and accumulation in a single cycle with low energy and area consumption. Post-simulation results show that this computing chain can achieve 20.89–40.72-TOPS/W energy efficiency and 4.498-TOPS/mm2 throughput.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"52-65"},"PeriodicalIF":2.8,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast-Convergence Near-Memory-Computing Accelerator for Solving Partial Differential Equations 求解偏微分方程的快速收敛近内存计算加速器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/TVLSI.2024.3458801
Chenjia Xie;Zhuang Shao;Ning Zhao;Xingyuan Hu;Yuan Du;Li Du
Solving partial differential equations (PDEs) is omnipresent in scientific research and engineering and requires expensive numerical iteration for memory and computation. The primary concerns for solving PDEs are convergence speed, data movement, and power consumption. This work proposed the first fast-convergence PDE solver with an automatic adjustment multiple-stride iteration method, significantly increasing the PDE convergence speed. A dynamic-precision near-memory-computing architecture with booth encoding is proposed to reduce iterated intermediate data movement. A customized 32T compressor and a 14T full adder are designed to reduce the power and hardware cost of the solver. The processor is fabricated using 65-nm CMOS technology and occupies a 6.25 mm2 die area. It can achieve a convergence speedup by $4times $ compared with the existing work.
求解偏微分方程在科学研究和工程中无处不在,需要昂贵的数值迭代来存储和计算。解决pde的主要关注点是收敛速度、数据移动和功耗。本文提出了首个采用自动调整多步迭代方法的快速收敛PDE求解器,显著提高了PDE的收敛速度。为了减少中间数据的重复移动,提出了一种采用booth编码的动态精度近内存计算体系结构。定制的32T压缩机和14T全加法器旨在降低求解器的功耗和硬件成本。该处理器采用65纳米CMOS技术制造,占地6.25 mm2的芯片面积。与现有的工作相比,它可以实现4倍的收敛速度。
{"title":"A Fast-Convergence Near-Memory-Computing Accelerator for Solving Partial Differential Equations","authors":"Chenjia Xie;Zhuang Shao;Ning Zhao;Xingyuan Hu;Yuan Du;Li Du","doi":"10.1109/TVLSI.2024.3458801","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458801","url":null,"abstract":"Solving partial differential equations (PDEs) is omnipresent in scientific research and engineering and requires expensive numerical iteration for memory and computation. The primary concerns for solving PDEs are convergence speed, data movement, and power consumption. This work proposed the first fast-convergence PDE solver with an automatic adjustment multiple-stride iteration method, significantly increasing the PDE convergence speed. A dynamic-precision near-memory-computing architecture with booth encoding is proposed to reduce iterated intermediate data movement. A customized 32T compressor and a 14T full adder are designed to reduce the power and hardware cost of the solver. The processor is fabricated using 65-nm CMOS technology and occupies a 6.25 mm2 die area. It can achieve a convergence speedup by <inline-formula> <tex-math>$4times $ </tex-math></inline-formula> compared with the existing work.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"578-582"},"PeriodicalIF":2.8,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information IEEE 超大规模集成 (VLSI) 系统论文集 出版信息
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/TVLSI.2024.3457191
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2024.3457191","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3457191","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10695157","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142324285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware–Algorithm Codesigned Low-Latency and Resource-Efficient OMP Accelerator for DOA Estimation on FPGA 基于FPGA的低延迟、资源高效的OMP加速器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/TVLSI.2024.3462467
Ruichang Jiang;Wenbin Ye
This article introduces an algorithm-hardware codesign optimized for low-latency and resource-efficient direction-of-arrival (DOA) estimation, employing a refined orthogonal matching pursuit (OMP) algorithm adept at handling the complexities of multisource detection, particularly in scenarios with closely spaced signal sources. At the algorithmic level, this approach incorporates a secondary correction mechanism (SCM) into the traditional OMP algorithm, significantly improving estimation accuracy and robustness. On the hardware front, a bespoke OMP accelerator has been developed, featuring a reconfigurable generic processing element (PE) array that supports various computational modes and leverages multilevel spectral peak search strategy and pipelining techniques to enhance computational efficiency. Experimental evaluations reveal that the proposed system achieves a root mean square error (RMSE) for DOA estimation of less than 0.3° in multisource conditions with a signal-to-noise ratio (SNR) of 20 dB. In addition, the deployment of the OMP accelerator on a Zynq XC7Z020 development board utilizes modest logic resources: 5.49k LUTs, 3.28k FFs, 11.5 BRAMs, and 32 DSPs. Furthermore, the design achieves a computational latency of $2.83~mu text { s}$ for single-source estimation with eight antennas. This achievement reflects a reduction of approximately 17.8% in LUTs, 56.3% in FFs, and 5.7% in DSPs compared to current leading-edge technologies after normalization all while maintaining competitive estimation accuracy and favorable estimation rates.
本文介绍了一种针对低延迟和资源高效的到达方向(DOA)估计进行优化的算法-硬件协同设计,采用了一种精细化的正交匹配追踪(OMP)算法,该算法擅长处理多源检测的复杂性,特别是在信号源间隔很近的情况下。在算法层面,该方法在传统的OMP算法中加入了二次校正机制(SCM),显著提高了估计精度和鲁棒性。在硬件方面,定制的OMP加速器已经开发出来,具有可重构的通用处理元件(PE)阵列,支持各种计算模式,并利用多级谱峰搜索策略和流水线技术来提高计算效率。实验结果表明,该系统在多源条件下的DOA估计均方根误差(RMSE)小于0.3°,信噪比(SNR)为20 dB。此外,在Zynq XC7Z020开发板上部署OMP加速器占用了适度的逻辑资源:5.49k lut, 3.28k ff, 11.5 bram和32个dsp。此外,该设计实现了8根天线单源估计的计算延迟为$2.83~mu text {s}$。这一成就反映了与目前的前沿技术相比,在归一化后,lut降低了约17.8%,ff降低了56.3%,dsp降低了5.7%,同时保持了具有竞争力的估计精度和有利的估计率。
{"title":"Hardware–Algorithm Codesigned Low-Latency and Resource-Efficient OMP Accelerator for DOA Estimation on FPGA","authors":"Ruichang Jiang;Wenbin Ye","doi":"10.1109/TVLSI.2024.3462467","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3462467","url":null,"abstract":"This article introduces an algorithm-hardware codesign optimized for low-latency and resource-efficient direction-of-arrival (DOA) estimation, employing a refined orthogonal matching pursuit (OMP) algorithm adept at handling the complexities of multisource detection, particularly in scenarios with closely spaced signal sources. At the algorithmic level, this approach incorporates a secondary correction mechanism (SCM) into the traditional OMP algorithm, significantly improving estimation accuracy and robustness. On the hardware front, a bespoke OMP accelerator has been developed, featuring a reconfigurable generic processing element (PE) array that supports various computational modes and leverages multilevel spectral peak search strategy and pipelining techniques to enhance computational efficiency. Experimental evaluations reveal that the proposed system achieves a root mean square error (RMSE) for DOA estimation of less than 0.3° in multisource conditions with a signal-to-noise ratio (SNR) of 20 dB. In addition, the deployment of the OMP accelerator on a Zynq XC7Z020 development board utilizes modest logic resources: 5.49k LUTs, 3.28k FFs, 11.5 BRAMs, and 32 DSPs. Furthermore, the design achieves a computational latency of <inline-formula> <tex-math>$2.83~mu text { s}$ </tex-math></inline-formula> for single-source estimation with eight antennas. This achievement reflects a reduction of approximately 17.8% in LUTs, 56.3% in FFs, and 5.7% in DSPs compared to current leading-edge technologies after normalization all while maintaining competitive estimation accuracy and favorable estimation rates.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"421-434"},"PeriodicalIF":2.8,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MBSNTT: A Highly Parallel Digital In-Memory Bit-Serial Number Theoretic Transform Accelerator MBSNTT:一个高度并行的数字内存位-序列号理论转换加速器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/TVLSI.2024.3462955
Akhil Pakala;Zhiyu Chen;Kaiyuan Yang
Conventional cryptographic systems protect the data security during communication but give third-party cloud operators complete access to compute decrypted user data. Homomorphic encryption (HE) promises to rectify this and allow computations on encrypted data to be done without actually decrypting it. However, HE encryption requires several orders of magnitude higher latency than conventional encryption schemes. Number theoretic transform (NTT), a polynomial multiplication algorithm, is the bottleneck function in HE. In traditional architectures, memory accesses and support for parallel operations limit NTT’s throughput and energy efficiency. Processing in memory (PIM) is an interesting approach that can maximize parallelism with high-energy efficiency. To enable HE on resource-constrained edge devices, this article presents MBSNTT, a digital in-memory Multi-Bit-Serial NTT accelerator, achieving high parallelism and energy efficiency for NTT with minimized area. MBSNTT features a novel multi-bit-serial modular multiplication algorithm and PIM implementation that computes all modular multiplications in an NTT in parallel. It further adopts a constant geometry NTT data flow for efficient transition between NTT stages and different cores. Our evaluation shows that MBSNTT achieves $1.62times $ ( $19.08times $ ) higher throughput and $64.9times $ ( $2.06times $ ) lower energy than state-of-the-art PIM NTT accelerators Crypto-PIM (MeNTT), at a polynomial order of 8 K and bit width of 128.
传统的加密系统在通信过程中保护数据安全,但允许第三方云运营商完全访问计算解密的用户数据。同态加密(HE)有望纠正这一点,并允许在不实际解密的情况下对加密数据进行计算。然而,HE加密需要比传统加密方案高几个数量级的延迟。数论变换(NTT)是一种多项式乘法算法,是高等数学中的瓶颈函数。在传统架构中,内存访问和对并行操作的支持限制了NTT的吞吐量和能源效率。内存处理(PIM)是一种有趣的方法,它可以以高能效最大化并行性。为了在资源受限的边缘设备上启用HE,本文介绍了MBSNTT,一种数字内存中多比特串行NTT加速器,以最小的面积实现了NTT的高并行性和能源效率。MBSNTT具有新颖的多比特串行模块化乘法算法和PIM实现,可以并行计算NTT中的所有模块化乘法。它还采用了恒定几何的NTT数据流,以便在NTT阶段和不同核心之间有效转换。我们的评估表明,MBSNTT的吞吐量比最先进的PIM NTT加速器Crypto-PIM (MeNTT)高1.62倍(19.08倍),能量低64.9倍(2.06倍),多项式阶为8 K,比特宽度为128。
{"title":"MBSNTT: A Highly Parallel Digital In-Memory Bit-Serial Number Theoretic Transform Accelerator","authors":"Akhil Pakala;Zhiyu Chen;Kaiyuan Yang","doi":"10.1109/TVLSI.2024.3462955","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3462955","url":null,"abstract":"Conventional cryptographic systems protect the data security during communication but give third-party cloud operators complete access to compute decrypted user data. Homomorphic encryption (HE) promises to rectify this and allow computations on encrypted data to be done without actually decrypting it. However, HE encryption requires several orders of magnitude higher latency than conventional encryption schemes. Number theoretic transform (NTT), a polynomial multiplication algorithm, is the bottleneck function in HE. In traditional architectures, memory accesses and support for parallel operations limit NTT’s throughput and energy efficiency. Processing in memory (PIM) is an interesting approach that can maximize parallelism with high-energy efficiency. To enable HE on resource-constrained edge devices, this article presents MBSNTT, a digital in-memory Multi-Bit-Serial NTT accelerator, achieving high parallelism and energy efficiency for NTT with minimized area. MBSNTT features a novel multi-bit-serial modular multiplication algorithm and PIM implementation that computes all modular multiplications in an NTT in parallel. It further adopts a constant geometry NTT data flow for efficient transition between NTT stages and different cores. Our evaluation shows that MBSNTT achieves <inline-formula> <tex-math>$1.62times $ </tex-math></inline-formula> (<inline-formula> <tex-math>$19.08times $ </tex-math></inline-formula>) higher throughput and <inline-formula> <tex-math>$64.9times $ </tex-math></inline-formula> (<inline-formula> <tex-math>$2.06times $ </tex-math></inline-formula>) lower energy than state-of-the-art PIM NTT accelerators Crypto-PIM (MeNTT), at a polynomial order of 8 K and bit width of 128.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"537-545"},"PeriodicalIF":2.8,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information 电气和电子工程师学会超大规模集成 (VLSI) 系统学会论文集信息
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-26 DOI: 10.1109/TVLSI.2024.3457193
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2024.3457193","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3457193","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10695474","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142324368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hi-NeRF: A Multicore NeRF Accelerator With Hierarchical Empty Space Skipping for Edge 3-D Rendering Hi-NeRF:为边缘三维渲染设计的多核 NeRF 加速器与分层空跳频
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-24 DOI: 10.1109/TVLSI.2024.3458032
Lizhou Wu;Haozhe Zhu;Jiapei Zheng;Mengjie Li;Yinuo Cheng;Qi Liu;Xiaoyang Zeng;Chixiao Chen
Neural radiance field (NeRF) has proved to be promising in augmented/virtual-reality applications. However, the deployment of NeRF on edge devices suffers from inadequate throughput due to redundant ray sampling and congested memory access. To address these challenges, this article proposes Hi-NeRF, a multirendering-core accelerator for efficient edge NeRF rendering. On the architecture level, a hierarchical empty space skipping (HESS) scheme is adopted, which efficiently locates the effective samples with fewer skipping steps and thus accelerates the ray marching process. Furthermore, to alleviate the memory access bottleneck, a vertex-interleaved mapping (VIM) method that eliminates memory bank conflicts is also proposed. On the hardware level, ineffective sample filters (ISFs) and voxel access filters (VCFs) are introduced to further exploit spatial sparsity and data locality at run-time. The experimental results show that our work achieves $2.67times $ rendering throughput and $11.2times $ energy efficiency compared to a SOTA NeRF rendering accelerator. The energy efficiency can be improved by $561times $ compared to a commercial GPU.
事实证明,神经辐射场(NeRF)在增强/虚拟现实应用中大有可为。然而,在边缘设备上部署 NeRF 会因冗余光线采样和拥塞的内存访问而导致吞吐量不足。为了应对这些挑战,本文提出了用于高效边缘 NeRF 渲染的多渲染内核加速器 Hi-NeRF。在架构层面,采用了分层空跳转(HESS)方案,以更少的跳转步骤有效定位有效样本,从而加速光线行进过程。此外,为了缓解内存访问瓶颈,还提出了一种消除内存库冲突的顶点交错映射(VIM)方法。在硬件层面,引入了无效采样滤波器(ISF)和体素访问滤波器(VCF),以进一步利用运行时的空间稀疏性和数据局部性。实验结果表明,与SOTA NeRF渲染加速器相比,我们的工作实现了2.67倍的渲染吞吐量和11.2倍的能效。与商用 GPU 相比,能效可提高 561 美元。
{"title":"Hi-NeRF: A Multicore NeRF Accelerator With Hierarchical Empty Space Skipping for Edge 3-D Rendering","authors":"Lizhou Wu;Haozhe Zhu;Jiapei Zheng;Mengjie Li;Yinuo Cheng;Qi Liu;Xiaoyang Zeng;Chixiao Chen","doi":"10.1109/TVLSI.2024.3458032","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458032","url":null,"abstract":"Neural radiance field (NeRF) has proved to be promising in augmented/virtual-reality applications. However, the deployment of NeRF on edge devices suffers from inadequate throughput due to redundant ray sampling and congested memory access. To address these challenges, this article proposes Hi-NeRF, a multirendering-core accelerator for efficient edge NeRF rendering. On the architecture level, a hierarchical empty space skipping (HESS) scheme is adopted, which efficiently locates the effective samples with fewer skipping steps and thus accelerates the ray marching process. Furthermore, to alleviate the memory access bottleneck, a vertex-interleaved mapping (VIM) method that eliminates memory bank conflicts is also proposed. On the hardware level, ineffective sample filters (ISFs) and voxel access filters (VCFs) are introduced to further exploit spatial sparsity and data locality at run-time. The experimental results show that our work achieves \u0000<inline-formula> <tex-math>$2.67times $ </tex-math></inline-formula>\u0000 rendering throughput and \u0000<inline-formula> <tex-math>$11.2times $ </tex-math></inline-formula>\u0000 energy efficiency compared to a SOTA NeRF rendering accelerator. The energy efficiency can be improved by \u0000<inline-formula> <tex-math>$561times $ </tex-math></inline-formula>\u0000 compared to a commercial GPU.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2315-2326"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust Hardware Trojan Detection Method by Unsupervised Learning of Electromagnetic Signals 通过电磁信号无监督学习的鲁棒硬件木马检测方法
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-24 DOI: 10.1109/TVLSI.2024.3458892
Daehyeon Lee;Junghee Lee;Younggiu Jung;Janghyuk Kauh;Taigon Song
This article explores the threat posed by Hardware Trojans (HTs), malicious circuits clandestinely embedded in hardware akin to software backdoors. Activation by attackers renders these Trojans capable of inducing malfunctions or leaking confidential information by manipulating the hardware’s normal operation. Despite robust software security, detecting and ensuring normal hardware operation becomes challenging in the presence of malicious circuits. This issue is particularly acute in weapon systems, where HTs can present a significant threat, potentially leading to immediate disablement in adversary countries. Given the severe risks associated with HTs, detection becomes imperative. The study focuses on demonstrating the efficacy of deep learning-based HT detection by comparing and analyzing methods using deep learning with existing approaches. This article proposes utilizing the deep support vector data description (Deep SVDD) model for HT detection. The proposed method outperforms existing methods when detecting untrained HTs. It achieves 92.87% of accuracy on average, which is higher than that of an existing method, 50.00%. This finding contributes valuable insights to the field of hardware security and lays the foundation for practical applications of Deep SVDD in real-world scenarios.
本文探讨了硬件木马(HTs)带来的威胁,这种恶意电路被秘密嵌入硬件中,类似于软件后门。攻击者激活这些木马后,就能通过操纵硬件的正常运行来诱发故障或泄露机密信息。尽管有强大的软件安全保障,但在存在恶意电路的情况下,检测和确保硬件正常运行仍具有挑战性。这一问题在武器系统中尤为突出,因为有害电路可能会对武器系统构成重大威胁,导致敌国的武器系统立即瘫痪。鉴于与 HT 相关的严重风险,检测变得势在必行。本研究侧重于通过比较和分析使用深度学习的方法与现有方法,展示基于深度学习的 HT 检测的功效。本文提出利用深度支持向量数据描述(Deep SVDD)模型进行 HT 检测。在检测未经训练的 HT 时,所提出的方法优于现有方法。它的平均准确率达到 92.87%,高于现有方法的 50.00%。这一发现为硬件安全领域提供了宝贵的见解,并为深度 SVDD 在现实世界场景中的实际应用奠定了基础。
{"title":"Robust Hardware Trojan Detection Method by Unsupervised Learning of Electromagnetic Signals","authors":"Daehyeon Lee;Junghee Lee;Younggiu Jung;Janghyuk Kauh;Taigon Song","doi":"10.1109/TVLSI.2024.3458892","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458892","url":null,"abstract":"This article explores the threat posed by Hardware Trojans (HTs), malicious circuits clandestinely embedded in hardware akin to software backdoors. Activation by attackers renders these Trojans capable of inducing malfunctions or leaking confidential information by manipulating the hardware’s normal operation. Despite robust software security, detecting and ensuring normal hardware operation becomes challenging in the presence of malicious circuits. This issue is particularly acute in weapon systems, where HTs can present a significant threat, potentially leading to immediate disablement in adversary countries. Given the severe risks associated with HTs, detection becomes imperative. The study focuses on demonstrating the efficacy of deep learning-based HT detection by comparing and analyzing methods using deep learning with existing approaches. This article proposes utilizing the deep support vector data description (Deep SVDD) model for HT detection. The proposed method outperforms existing methods when detecting untrained HTs. It achieves 92.87% of accuracy on average, which is higher than that of an existing method, 50.00%. This finding contributes valuable insights to the field of hardware security and lays the foundation for practical applications of Deep SVDD in real-world scenarios.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2327-2340"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10689630","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast Transient Response Distributed Power Supply With Dynamic Output Switching for Power Side-Channel Attack Mitigation 基于动态输出开关的快速瞬态响应分布式电源侧信道攻击缓解
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-24 DOI: 10.1109/TVLSI.2024.3433429
Xingye Liu;Paul Ampadu
We present a distributed power supply and explore its load transient response and power side-channel security improvements. Typically, countermeasures against power side-channel attacks (PSCAs) are based on specialized dc/dc converters, resulting in large power and area overheads and they are difficult to scale. Moreover, due to limited output voltage range and load regulation, it is not feasible to directly distribute these converters in multicore applications. Targeting those issues, our proposed converter is designed to provide multiple fast-responding voltages and use shared circuits to mitigate PSCAs. The proposed three-output dc/dc converter can deliver 0.33–0.92 V with up to 1 A to each load. Comparing with state-of-the-art power management works, our converter has $2times $ load step response speed and $4times $ reference voltage tracking speed. Furthermore, the converter requires $9times $ less inductance and $3times $ less output capacitance. In terms of PSCA mitigation, this converter reduces the correlation between input power trace and encryption load current by $107times $ , which is $3times $ better than the best standalone work, and it only induces 1.7% area overhead and 2.5% power overhead. The proposed work also increases minimum traces to disclose (MTDs) by $1250times $ . Considering all the above, our work could be a great candidate to be employed in future multicore systems supplying varying voltages and resisting side-channel attacks. It is the first work bridging the gap between on-chip power management and side-channel security.
我们提出了一种分布式电源,并探讨了其负载暂态响应和功率侧信道安全性的改进。通常,针对功率侧信道攻击(psca)的对策是基于专门的dc/dc转换器,导致功率和面积开销较大,并且难以扩展。此外,由于输出电压范围和负载调节的限制,在多核应用中直接分布这些变换器是不可行的。针对这些问题,我们提出的转换器旨在提供多个快速响应电压,并使用共享电路来减轻psca。提出的三输出dc/dc变换器可以为每个负载提供0.33-0.92 V,最高为1 A。与最先进的电源管理工作相比,我们的转换器具有$2倍的负载阶跃响应速度和$4倍的参考电压跟踪速度。此外,转换器需要的电感减少9倍,输出电容减少3倍。在PSCA缓解方面,该转换器将输入功率迹线和加密负载电流之间的相关性降低了107倍,比最佳的独立工作好3倍,并且仅引起1.7%的面积开销和2.5%的功率开销。所提出的工作还将最小跟踪披露(MTDs)增加了1250美元。考虑到所有这些,我们的工作可能是未来多核系统中提供不同电压和抵抗侧通道攻击的一个很好的候选。这是第一个弥合片上电源管理和侧通道安全之间差距的工作。
{"title":"A Fast Transient Response Distributed Power Supply With Dynamic Output Switching for Power Side-Channel Attack Mitigation","authors":"Xingye Liu;Paul Ampadu","doi":"10.1109/TVLSI.2024.3433429","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3433429","url":null,"abstract":"We present a distributed power supply and explore its load transient response and power side-channel security improvements. Typically, countermeasures against power side-channel attacks (PSCAs) are based on specialized dc/dc converters, resulting in large power and area overheads and they are difficult to scale. Moreover, due to limited output voltage range and load regulation, it is not feasible to directly distribute these converters in multicore applications. Targeting those issues, our proposed converter is designed to provide multiple fast-responding voltages and use shared circuits to mitigate PSCAs. The proposed three-output dc/dc converter can deliver 0.33–0.92 V with up to 1 A to each load. Comparing with state-of-the-art power management works, our converter has \u0000<inline-formula> <tex-math>$2times $ </tex-math></inline-formula>\u0000 load step response speed and \u0000<inline-formula> <tex-math>$4times $ </tex-math></inline-formula>\u0000 reference voltage tracking speed. Furthermore, the converter requires \u0000<inline-formula> <tex-math>$9times $ </tex-math></inline-formula>\u0000 less inductance and \u0000<inline-formula> <tex-math>$3times $ </tex-math></inline-formula>\u0000 less output capacitance. In terms of PSCA mitigation, this converter reduces the correlation between input power trace and encryption load current by \u0000<inline-formula> <tex-math>$107times $ </tex-math></inline-formula>\u0000, which is \u0000<inline-formula> <tex-math>$3times $ </tex-math></inline-formula>\u0000 better than the best standalone work, and it only induces 1.7% area overhead and 2.5% power overhead. The proposed work also increases minimum traces to disclose (MTDs) by \u0000<inline-formula> <tex-math>$1250times $ </tex-math></inline-formula>\u0000. Considering all the above, our work could be a great candidate to be employed in future multicore systems supplying varying voltages and resisting side-channel attacks. It is the first work bridging the gap between on-chip power management and side-channel security.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"261-274"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL 基于标准 OpenCL 的位流数据库驱动 FPGA 编程流程
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-24 DOI: 10.1109/TVLSI.2024.3458062
Topi Leppänen;Leevi Leppänen;Joonas Multanen;Pekka Jääskeläinen
Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated FPGA management tool AFOCL, with a guiding principle that the software programmer should only need to use the standard OpenCL API to manage FPGA acceleration tasks. This improves portability since the same OpenCL program will work on any OpenCL-compliant computation device able to execute the same kernels, including CPUs, GPUs, and FPGAs. The proposed approach is based on pre-optimized FPGA bitstreams implementing well-defined OpenCL built-in kernels. This enables a clean separation of responsibilities between a hardware developer preparing the FPGA bitstreams containing the kernel implementations, a software developer launching computation tasks as OpenCL built-in kernels, and a bitstream distributor providing preoptimized FPGA IPs to end-users. The automated FPGA programming tool fetches bitstream files as needed from the distributor, reconfigures the FPGA, and manages the communication with the accelerator. We demonstrate that it is possible to achieve similar performance as the current FPGA vendor OpenCL implementations, while abstracting all FPGA-specific details from the software programmer. The cross-vendor potential of AFOCL is shown by porting the implementation to FPGAs from two different vendors (AMD and Altera), and to two different FPGA types [PCIe and system-on-chip (SoC)], and controlling all these systems with the same OpenCL host program.
现场可编程门阵列(FPGA)供应商提供高级合成(HLS)编译器和配套的 OpenCL 运行时,使非硬件专家更容易使用其器件。然而,目前供应商提供的运行时不符合 OpenCL 标准,限制了应用的可移植性,使 FPGA 设备难以集成到异构计算平台中。我们提出了一种自动化 FPGA 管理工具 AFOCL,其指导原则是软件程序员只需使用标准 OpenCL API 管理 FPGA 加速任务。这就提高了可移植性,因为相同的 OpenCL 程序可以在任何符合 OpenCL 标准、能够执行相同内核的计算设备上运行,包括 CPU、GPU 和 FPGA。建议的方法基于预先优化的 FPGA 比特流,执行定义明确的 OpenCL 内置内核。这样,硬件开发人员可以准备包含内核实现的 FPGA 比特流,软件开发人员可以启动作为 OpenCL 内置内核的计算任务,比特流分发人员可以向最终用户提供预优化的 FPGA IP,三者之间的责任就可以完全分离。自动 FPGA 编程工具根据需要从分发器获取比特流文件,重新配置 FPGA,并管理与加速器的通信。我们证明,可以实现与当前 FPGA 供应商 OpenCL 实现类似的性能,同时从软件编程人员那里抽象出所有 FPGA 特有的细节。通过将 AFOCL 移植到两个不同厂商(AMD 和 Altera)的 FPGA 和两种不同类型的 FPGA(PCIe 和片上系统 (SoC)),并用相同的 OpenCL 主程序控制所有这些系统,我们展示了 AFOCL 的跨厂商潜力。
{"title":"Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL","authors":"Topi Leppänen;Leevi Leppänen;Joonas Multanen;Pekka Jääskeläinen","doi":"10.1109/TVLSI.2024.3458062","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458062","url":null,"abstract":"Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated FPGA management tool AFOCL, with a guiding principle that the software programmer should only need to use the standard OpenCL API to manage FPGA acceleration tasks. This improves portability since the same OpenCL program will work on any OpenCL-compliant computation device able to execute the same kernels, including CPUs, GPUs, and FPGAs. The proposed approach is based on pre-optimized FPGA bitstreams implementing well-defined OpenCL built-in kernels. This enables a clean separation of responsibilities between a hardware developer preparing the FPGA bitstreams containing the kernel implementations, a software developer launching computation tasks as OpenCL built-in kernels, and a bitstream distributor providing preoptimized FPGA IPs to end-users. The automated FPGA programming tool fetches bitstream files as needed from the distributor, reconfigures the FPGA, and manages the communication with the accelerator. We demonstrate that it is possible to achieve similar performance as the current FPGA vendor OpenCL implementations, while abstracting all FPGA-specific details from the software programmer. The cross-vendor potential of AFOCL is shown by porting the implementation to FPGAs from two different vendors (AMD and Altera), and to two different FPGA types [PCIe and system-on-chip (SoC)], and controlling all these systems with the same OpenCL host program.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2257-2268"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10689610","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1