首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving 利用结构化特征和运行时隔离实现高效推荐服务
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-28 DOI: 10.1109/TC.2024.3449749
Xin You;Hailong Yang;Siqi Wang;Tao Peng;Chen Ding;Xinyuan Li;Bangduo Chen;Zhongzhi Luan;Tongxuan Liu;Yong Li;Depei Qian
Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommendation queries with stringent service level agreements, high-performant recommendation serving systems play an essential role in meeting such daunting demand. Unfortunately, existing model serving frameworks fail to achieve efficient serving due to unique challenges such as 1) the input format mismatch between service needs and the model's ability and 2) heavy software contentions to concurrently execute the constrained operations. To address the above challenges, we propose RecServe, a high-performant serving system for recommendation with the optimized design of structured features and SessionGroups for recommendation serving. With structured features, RecServe packs single-user-multiple-candidates inputs by semi-automatically transforming computation graphs with annotated input tensors, which can significantly reduce redundant network transmission, data movements, and useless computations. With session group, RecServe further adopts resource isolations for multiple compute streams and cost-aware operator scheduler with critical-path-based schedule policy to enable concurrent kernel execution, further improving serving throughput. The experiment results demonstrate that RecServe can achieve maximum performance speedups of 12.3$boldsymbol{times}$ and $22.0boldsymbol{times}$ compared to the state-of-the-art serving system on CPU and GPU platforms, respectively.
利用深度学习模型提供推荐服务是现代电子商务公司最有价值的服务之一。在生产过程中,为了满足数十亿次推荐查询和严格的服务水平协议,高性能的推荐服务系统在满足如此巨大的需求方面发挥着至关重要的作用。遗憾的是,现有的模型服务框架无法实现高效服务,原因在于存在以下独特的挑战:1)服务需求与模型能力之间的输入格式不匹配;2)同时执行受限操作的软件任务繁重。针对上述挑战,我们提出了一个高性能的推荐服务系统 RecServe,该系统对结构化特征和会话组进行了优化设计,以提供推荐服务。利用结构化特征,RecServe 通过半自动转换带有注释的输入张量的计算图来打包单用户-多候选输入,这可以大大减少冗余的网络传输、数据移动和无用的计算。在会话组的基础上,RecServe 进一步采用了多个计算流的资源隔离和基于临界路径调度策略的成本感知操作员调度器,以实现并发内核执行,从而进一步提高服务吞吐量。实验结果表明,与CPU和GPU平台上最先进的服务系统相比,RecServe的最高性能分别提高了12.3倍和22.0倍。
{"title":"Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving","authors":"Xin You;Hailong Yang;Siqi Wang;Tao Peng;Chen Ding;Xinyuan Li;Bangduo Chen;Zhongzhi Luan;Tongxuan Liu;Yong Li;Depei Qian","doi":"10.1109/TC.2024.3449749","DOIUrl":"10.1109/TC.2024.3449749","url":null,"abstract":"Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommendation queries with stringent service level agreements, high-performant recommendation serving systems play an essential role in meeting such daunting demand. Unfortunately, existing model serving frameworks fail to achieve efficient serving due to unique challenges such as 1) the input format mismatch between service needs and the model's ability and 2) heavy software contentions to concurrently execute the constrained operations. To address the above challenges, we propose \u0000<i>RecServe</i>\u0000, a high-performant serving system for recommendation with the optimized design of \u0000<i>structured features</i>\u0000 and \u0000<i>SessionGroups</i>\u0000 for recommendation serving. With \u0000<i>structured features</i>\u0000, \u0000<i>RecServe</i>\u0000 packs single-user-multiple-candidates inputs by semi-automatically transforming computation graphs with annotated input tensors, which can significantly reduce redundant network transmission, data movements, and useless computations. With \u0000<i>session group</i>\u0000, \u0000<i>RecServe</i>\u0000 further adopts resource isolations for multiple compute streams and cost-aware operator scheduler with critical-path-based schedule policy to enable concurrent kernel execution, further improving serving throughput. The experiment results demonstrate that \u0000<i>RecServe</i>\u0000 can achieve maximum performance speedups of 12.3\u0000<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$22.0boldsymbol{times}$</tex-math></inline-formula>\u0000 compared to the state-of-the-art serving system on CPU and GPU platforms, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2474-2487"},"PeriodicalIF":3.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Falic: An FPGA-Based Multi-Scalar Multiplication Accelerator for Zero-Knowledge Proof 法利克基于 FPGA 的零知识证明多乘法加速器
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449121
Yongkui Yang;Zhenyan Lu;Jingwei Zeng;Xingguo Liu;Xuehai Qian;Zhibin Yu
In this paper, we propose Falic, a novel FPGA-based accelerator to accelerate multi-scalar multiplication (MSM), the most time-consuming phase of zk-SNARK proof generation. Falic innovates three techniques. First, it leverages globally asynchronous locally synchronous (GALS) strategy to build multiple small and lightweight MSM cores to parallelize the independent inner product computation on different portions of the scalar vector and point vector. Second, each MSM core contains just one large-integer modular multiplier (LIMM) that is multiplexed to perform the point additions (PADDs) generated during MSM. We strike a balance between the throughput and hardware cost by batching the appropriate number of PADDs and selecting the computation graph of PADD with proper parallelism degree. Finally, the performance is further improved by a simple cache structure that enables the computation reuse. We implement Falic on two different FPGAs with different hardware resources, i.e., the Xilinx U200 and Xilinx U250. Compared to the prior FPGA-based accelerator, Falic improves the MSM throughput by $3.9boldsymbol{times}$. Experimental results also show that Falic achieves a throughput speedup of up to $1.62boldsymbol{times}$ and saves as much as $8.5boldsymbol{times}$ energy compared to an RTX 2080Ti GPU.
本文提出了一种基于 FPGA 的新型加速器 Falic,用于加速多标量乘法 (MSM),这是 zk-SNARK 证明生成过程中最耗时的阶段。Falic 创新了三种技术。首先,它利用全局异步局部同步(GALS)策略构建了多个小型轻量级 MSM 内核,以并行处理标量向量和点向量不同部分的独立内积计算。其次,每个 MSM 内核仅包含一个大整数模块乘法器 (LIMM),该乘法器被复用以执行 MSM 期间生成的点加法 (PADD)。我们通过批处理适当数量的 PADD 和选择具有适当并行度的 PADD 计算图,在吞吐量和硬件成本之间取得平衡。最后,简单的缓存结构实现了计算的重复使用,从而进一步提高了性能。我们在两种具有不同硬件资源的 FPGA(即 Xilinx U200 和 Xilinx U250)上实现了 Falic。与之前基于 FPGA 的加速器相比,Falic 将 MSM 吞吐量提高了 3.9 美元(boldsymbol{times}$)。实验结果还显示,与 RTX 2080Ti GPU 相比,Falic 实现了高达 1.62 美元的吞吐量加速,并节省了高达 8.5 美元的能耗。
{"title":"Falic: An FPGA-Based Multi-Scalar Multiplication Accelerator for Zero-Knowledge Proof","authors":"Yongkui Yang;Zhenyan Lu;Jingwei Zeng;Xingguo Liu;Xuehai Qian;Zhibin Yu","doi":"10.1109/TC.2024.3449121","DOIUrl":"10.1109/TC.2024.3449121","url":null,"abstract":"In this paper, we propose Falic, a novel FPGA-based accelerator to accelerate multi-scalar multiplication (MSM), the most time-consuming phase of zk-SNARK proof generation. Falic innovates three techniques. First, it leverages globally asynchronous locally synchronous (GALS) strategy to build multiple small and lightweight MSM cores to parallelize the independent inner product computation on different portions of the scalar vector and point vector. Second, each MSM core contains just one large-integer modular multiplier (LIMM) that is multiplexed to perform the point additions (PADDs) generated during MSM. We strike a balance between the throughput and hardware cost by batching the appropriate number of PADDs and selecting the computation graph of PADD with proper parallelism degree. Finally, the performance is further improved by a simple cache structure that enables the computation reuse. We implement Falic on two different FPGAs with different hardware resources, i.e., the Xilinx U200 and Xilinx U250. Compared to the prior FPGA-based accelerator, Falic improves the MSM throughput by \u0000<inline-formula><tex-math>$3.9boldsymbol{times}$</tex-math></inline-formula>\u0000. Experimental results also show that Falic achieves a throughput speedup of up to \u0000<inline-formula><tex-math>$1.62boldsymbol{times}$</tex-math></inline-formula>\u0000 and saves as much as \u0000<inline-formula><tex-math>$8.5boldsymbol{times}$</tex-math></inline-formula>\u0000 energy compared to an RTX 2080Ti GPU.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2791-2804"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices HGNAS:面向边缘设备的硬件感知图神经架构搜索
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449108
Ao Zhou;Jianlei Yang;Yingjie Qi;Tong Qiao;Yumeng Shi;Cenlin Duan;Weisheng Zhao;Chunming Hu
Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a $10.6boldsymbol{times}$ speedup and an $82.5%$ peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.
图神经网络(GNN)因其最先进的(SOTA)性能,在基于图的学习任务(如点云处理)中越来越受欢迎。然而,研究界主要关注的是提高模型的表达能力,而没有考虑如何为具有实时性要求和资源有限的边缘场景设计高效的 GNN 模型。对现有 GNN 模型的研究表明,不同平台的执行情况各不相同,经常出现内存不足(OOM)问题,这凸显了硬件感知 GNN 设计的必要性。为应对这一挑战,本研究提出了一种为资源受限的边缘设备量身定制的新型硬件感知图神经架构搜索框架,即 HGNAS。为实现硬件感知,HGNAS 集成了高效的 GNN 硬件性能预测器,能以毫秒为单位评估 GNN 的延迟和内存使用峰值。同时,我们研究了 GNN 在推理过程中的内存使用情况,并提供了一种峰值内存估算方法,在与预测器结果相结合时增强了架构评估的鲁棒性。此外,HGNAS 还构建了一个细粒度设计空间,通过解耦 GNN 范式,探索极限性能架构。此外,HGNAS 还利用多级分层搜索策略,方便浏览庞大的候选方案,从而将单次搜索时间缩短到几个 GPU 小时。据我们所知,HGNAS 是首个面向边缘设备的自动 GNN 设计框架,也是首个实现跨不同平台 GNN 硬件感知的工作。跨各种应用和边缘设备的广泛实验证明了 HGNAS 的优越性。与 ModelNet40 上的 DGCNN 相比,HGNAS 的速度提高了 10.6 美元,内存峰值减少了 82.5%,精度损失几乎可以忽略不计。
{"title":"HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices","authors":"Ao Zhou;Jianlei Yang;Yingjie Qi;Tong Qiao;Yumeng Shi;Cenlin Duan;Weisheng Zhao;Chunming Hu","doi":"10.1109/TC.2024.3449108","DOIUrl":"10.1109/TC.2024.3449108","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a \u0000<inline-formula><tex-math>$10.6boldsymbol{times}$</tex-math></inline-formula>\u0000 speedup and an \u0000<inline-formula><tex-math>$82.5%$</tex-math></inline-formula>\u0000 peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2693-2707"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enabling Efficient Deep Learning on MCU With Transient Redundancy Elimination 通过消除瞬态冗余在 MCU 上实现高效深度学习
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449102
Jiesong Liu;Feng Zhang;Jiawei Guan;Hsin-Hsuan Sung;Xiaoguang Guo;Saiqin Long;Xiaoyong Du;Xipeng Shen
Deploying deep neural networks (DNNs) with satisfactory performance in resource-constrained environments is challenging. This is especially true of microcontrollers due to their tight space and computational capabilities. However, there is a growing demand for DNNs on microcontrollers, as executing large DNNs on microcontrollers is critical to reducing energy consumption, increasing performance efficiency, and eliminating privacy concerns. This paper presents a novel and systematic data redundancy elimination method to implement efficient DNNs on microcontrollers through innovations in computation and space optimization. By making the optimization itself a trainable component in the target neural networks, this method maximizes performance benefits while keeping the DNN accuracy stable. Experiments are performed on two microcontroller boards with three popular DNNs, namely CifarNet, ZfNet and SqueezeNet. Experiments show that this solution eliminates more than 96% of computations in DNNs and makes them fit well on microcontrollers, yielding 3.4-5$times$ speedup with little loss of accuracy.
在资源有限的环境中部署性能令人满意的深度神经网络(DNN)是一项挑战。由于微控制器的空间和计算能力有限,这种情况尤为突出。然而,由于在微控制器上执行大型 DNN 对于降低能耗、提高性能效率和消除隐私问题至关重要,因此对微控制器上 DNN 的需求日益增长。本文提出了一种新颖、系统的数据冗余消除方法,通过计算和空间优化方面的创新,在微控制器上实现高效的 DNN。通过将优化本身作为目标神经网络中的可训练组件,该方法在保持 DNN 精度稳定的同时,最大限度地提高了性能。在两块微控制器板上使用三种流行的 DNN(即 CifarNet、ZfNet 和 SqueezeNet)进行了实验。实验结果表明,该解决方案消除了 DNN 中 96% 以上的计算,使它们能够很好地适应微控制器,速度提高了 3.4-5 美元/次,而准确性几乎没有损失。
{"title":"Enabling Efficient Deep Learning on MCU With Transient Redundancy Elimination","authors":"Jiesong Liu;Feng Zhang;Jiawei Guan;Hsin-Hsuan Sung;Xiaoguang Guo;Saiqin Long;Xiaoyong Du;Xipeng Shen","doi":"10.1109/TC.2024.3449102","DOIUrl":"10.1109/TC.2024.3449102","url":null,"abstract":"Deploying deep neural networks (DNNs) with satisfactory performance in resource-constrained environments is challenging. This is especially true of microcontrollers due to their tight space and computational capabilities. However, there is a growing demand for DNNs on microcontrollers, as executing large DNNs on microcontrollers is critical to reducing energy consumption, increasing performance efficiency, and eliminating privacy concerns. This paper presents a novel and systematic data redundancy elimination method to implement efficient DNNs on microcontrollers through innovations in computation and space optimization. By making the optimization itself a trainable component in the target neural networks, this method maximizes performance benefits while keeping the DNN accuracy stable. Experiments are performed on two microcontroller boards with three popular DNNs, namely CifarNet, ZfNet and SqueezeNet. Experiments show that this solution eliminates more than 96% of computations in DNNs and makes them fit well on microcontrollers, yielding 3.4-5\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup with little loss of accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2649-2663"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays BiRD:用于提高收缩阵列深度卷积性能的双向输入重复使用数据流
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449103
Mingeon Park;Seokjin Hwang;Hyungmin Cho
Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called BiRD, designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32$times{}$32 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7$times{}$ performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.
深度卷积(DWConv)是减少卷积神经网络规模和计算要求的有效技术。然而,DWConv 的输入重用模式不容易转化为密集矩阵乘法,导致现有系统阵列的处理元件(PE)利用率较低。在本文中,我们介绍了一种名为 BiRD 的新型收缩阵列数据流机制,旨在最大限度地提高输入重用率,提升 DWConv 性能。BiRD 利用两个方向的输入重用,只需对典型的权重静态型收缩阵列稍作修改即可。我们在 Gemmini 平台上对 BiRD 进行了评估,并将其与现有的数据流类型进行了比较。结果表明,与其他数据流类型相比,BiRD 在减少计算时间方面实现了显著的性能提升,同时产生的面积开销最小,能耗也有所改善。例如,在一个 32$times{}$32 的收缩阵列上,BiRD 的面积开销为 9.8%,明显小于 DWConv 的其他数据流类型。与基于矩阵乘法的 DWConv 相比,BiRD 使 MobileNet-V2 的 DWConv 层性能提高了 4.7$times{}$,推理计算总时间减少了 55.8%,能耗降低了 44.9%。我们的研究结果凸显了 BiRD 在提高 DWConv 在收缩阵列上的性能方面的有效性。
{"title":"BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays","authors":"Mingeon Park;Seokjin Hwang;Hyungmin Cho","doi":"10.1109/TC.2024.3449103","DOIUrl":"10.1109/TC.2024.3449103","url":null,"abstract":"Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called \u0000<i>BiRD</i>\u0000, designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32\u0000<inline-formula><tex-math>$times{}$</tex-math></inline-formula>\u000032 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7\u0000<inline-formula><tex-math>$times{}$</tex-math></inline-formula>\u0000 performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2708-2721"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks 针对高效深度神经网络的联合剪枝和信道混合精度量化技术
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449084
Beatrice Alessandra Motetti;Matteo Risso;Alessio Burrello;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari
The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.
深度神经网络(DNN)的资源需求对其在边缘设备上的部署构成了巨大挑战。解决这一问题的常见方法是剪枝和混合精度量化,它们可以改善延迟和内存占用。这些优化技术通常是独立应用的。我们提出了一种新方法,通过基于梯度的轻量级搜索,以硬件感知的方式联合应用这些技术,大大缩短了生成帕累托最优 DNN 所需的时间,实现了精度与成本(即延迟或内存)的对比。我们在三个边缘相关基准上测试了我们的方法,即 CIFAR-10、Google Speech Commands 和 Tiny ImageNet。在针对内存占用进行优化时,我们能够实现 47.50% 和 69.54% 的大小缩减,与所有权重量化为 8 位和 2 位的基线网络达到等精度。我们的方法超越了之前最先进的方法,在等精度情况下,体积缩小了 56.17%。与连续应用最先进的剪枝和混合精度优化方法相比,我们获得了相当或更优的结果,但训练时间却大大缩短。此外,我们还展示了在针对特定硬件进行部署时,量身定制的成本模型如何改善成本与精度之间的权衡。
{"title":"Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks","authors":"Beatrice Alessandra Motetti;Matteo Risso;Alessio Burrello;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari","doi":"10.1109/TC.2024.3449084","DOIUrl":"10.1109/TC.2024.3449084","url":null,"abstract":"The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2619-2633"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimized Quantum Circuit of AES With Interlacing-Uncompute Structure 具有交错-非计算结构的 AES 优化量子电路
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449094
Mengyuan Zhang;Tairong Shi;Wenling Wu;Han Sui
In the post-quantum era, the security level of encryption algorithms is often evaluated based on the quantum resources required to attack AES. In this work, we make thoroughly estimations on various performance metrics of the quantum circuit of AES-128/192/256. Firstly, we introduce a generic round structure for in-place implementation of the AES algorithm, maximizing the parallelism between nonlinear components. Specifically, when employed as an encryption oracle, our structure reduces the $T$-depth from $2rd$ to $(r+1)d$. Furthermore, by leveraging the properties of block-cyclic matrices, we present an in-place implementation circuit for MixColumn with depth 10, utilizing 105 CNOT gates. In relation to the S-box, we have assessed its minimum circuit width at different $T$-depths and provide multiple versions of circuit implementations for a depth-width trade-off. Finally, based on our optimized S-box circuit, we conduct a comprehensive analysis of the implementation complexity of different round structures, where our structure exhibits significant advantages in terms of low $T$-depth.
在后量子时代,加密算法的安全等级通常是根据攻击 AES 所需的量子资源来评估的。在这项工作中,我们对 AES-128/192/256 的量子电路的各种性能指标进行了全面估算。首先,我们引入了一种用于就地实现 AES 算法的通用轮结构,最大限度地提高了非线性组件之间的并行性。具体来说,当作为加密甲骨文使用时,我们的结构将 $T$ 深度从 2rd$ 减少到 $(r+1)d$。此外,通过利用块周期矩阵的特性,我们提出了深度为 10 的 MixColumn 就地实现电路,使用了 105 个 CNOT 门。关于 S-box,我们评估了其在不同 T$ 深度下的最小电路宽度,并提供了多个版本的电路实现,以权衡深度和宽度。最后,基于优化后的 S-box 电路,我们对不同圆形结构的实现复杂性进行了全面分析,其中我们的结构在低 T$ 深度方面具有显著优势。
{"title":"Optimized Quantum Circuit of AES With Interlacing-Uncompute Structure","authors":"Mengyuan Zhang;Tairong Shi;Wenling Wu;Han Sui","doi":"10.1109/TC.2024.3449094","DOIUrl":"10.1109/TC.2024.3449094","url":null,"abstract":"In the post-quantum era, the security level of encryption algorithms is often evaluated based on the quantum resources required to attack AES. In this work, we make thoroughly estimations on various performance metrics of the quantum circuit of AES-128/192/256. Firstly, we introduce a generic round structure for in-place implementation of the AES algorithm, maximizing the parallelism between nonlinear components. Specifically, when employed as an encryption oracle, our structure reduces the \u0000<inline-formula><tex-math>$T$</tex-math></inline-formula>\u0000-depth from \u0000<inline-formula><tex-math>$2rd$</tex-math></inline-formula>\u0000 to \u0000<inline-formula><tex-math>$(r+1)d$</tex-math></inline-formula>\u0000. Furthermore, by leveraging the properties of block-cyclic matrices, we present an in-place implementation circuit for MixColumn with depth 10, utilizing 105 CNOT gates. In relation to the S-box, we have assessed its minimum circuit width at different \u0000<inline-formula><tex-math>$T$</tex-math></inline-formula>\u0000-depths and provide multiple versions of circuit implementations for a depth-width trade-off. Finally, based on our optimized S-box circuit, we conduct a comprehensive analysis of the implementation complexity of different round structures, where our structure exhibits significant advantages in terms of low \u0000<inline-formula><tex-math>$T$</tex-math></inline-formula>\u0000-depth.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2563-2575"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCARF: Securing Chips With a Robust Framework Against Fabrication-Time Hardware Trojans SCARF:利用稳健框架确保芯片安全,防范制造时硬件木马
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449082
Mohammad Eslami;Tara Ghasempouri;Samuel Pagliarini
The globalization of the semiconductor industry has introduced security challenges to Integrated Circuits (ICs), particularly those related to the threat of Hardware Trojans (HTs) – malicious logic that can be introduced during IC fabrication. While significant efforts are directed towards verifying the correctness and reliability of ICs, their security is often overlooked. In this paper, we propose a comprehensive framework that integrates a suite of methodologies for both front-end and back-end stages of design, aimed at enhancing the security of ICs. Initially, we outline a systematic methodology to transform existing verification assets into potent security checkers by repurposing verification assertions. To further improve security, we introduce an innovative methodology for integrating online monitors during physical synthesis – a back-end insertion providing an additional layer of defense. Experimental results demonstrate a significant increase in security, measured by our introduced metric, Security Coverage (SC), with a marginal rise in area and power consumption, typically under 20%. The insertion of online monitors during physical synthesis enhances security metrics by up to 33.5%. This holistic framework offers a comprehensive defense mechanism across the entire spectrum of IC design.
半导体行业的全球化给集成电路(IC)带来了安全挑战,特别是与硬件木马(HT)威胁有关的挑战,即在集成电路制造过程中可能引入的恶意逻辑。虽然人们在验证集成电路的正确性和可靠性方面做出了巨大努力,但其安全性却常常被忽视。在本文中,我们提出了一个综合框架,该框架集成了一整套方法,适用于设计的前端和后端阶段,旨在提高集成电路的安全性。首先,我们概述了一种系统方法,通过重新利用验证断言,将现有验证资产转化为有效的安全检查器。为了进一步提高安全性,我们介绍了一种在物理综合过程中集成在线监控器的创新方法--后端插入提供了额外的防御层。实验结果表明,通过我们引入的指标--安全覆盖率(SC)--来衡量,安全性有了显著提高,而面积和功耗仅略有增加,通常低于 20%。在物理合成过程中插入在线监控器可将安全性指标提高 33.5%。这一整体框架为整个集成电路设计提供了全面的防御机制。
{"title":"SCARF: Securing Chips With a Robust Framework Against Fabrication-Time Hardware Trojans","authors":"Mohammad Eslami;Tara Ghasempouri;Samuel Pagliarini","doi":"10.1109/TC.2024.3449082","DOIUrl":"10.1109/TC.2024.3449082","url":null,"abstract":"The globalization of the semiconductor industry has introduced security challenges to Integrated Circuits (ICs), particularly those related to the threat of Hardware Trojans (HTs) – malicious logic that can be introduced during IC fabrication. While significant efforts are directed towards verifying the correctness and reliability of ICs, their security is often overlooked. In this paper, we propose a comprehensive framework that integrates a suite of methodologies for both front-end and back-end stages of design, aimed at enhancing the security of ICs. Initially, we outline a systematic methodology to transform existing verification assets into potent security checkers by repurposing verification assertions. To further improve security, we introduce an innovative methodology for integrating online monitors during physical synthesis – a back-end insertion providing an additional layer of defense. Experimental results demonstrate a significant increase in security, measured by our introduced metric, Security Coverage (SC), with a marginal rise in area and power consumption, typically under 20%. The insertion of online monitors during physical synthesis enhances security metrics by up to 33.5%. This holistic framework offers a comprehensive defense mechanism across the entire spectrum of IC design.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2761-2775"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-Key Attack on Full-Round Shadow Designed for IoT Nodes 专为物联网节点设计的全圆影单键攻击
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449040
Yuhan Zhang;Wenling Wu;Lei Zhang;Yafei Zheng
With the rapid advancement of the Internet of Things (IoT), many innovative lightweight block ciphers have been introduced to meet the stringent security demands of IoT devices. Among these, the Shadow cipher stands out for its compactness, making it particularly well-suited for deployment in resource-constrained IoT nodes (IEEE Internet of Things Journal, 2021). This paper demonstrates two real-time attacks on Shadow for the first time: real-time plaintext recovery and key recovery. Firstly, numerous properties of Shadow are discussed, illustrating an equivalent representation of the two-round Shadow and the relationship between the round keys. Secondly, we introduce multiple two-round iterative linear approximations. Employing these approximations enables the derivation of full-round linear distinguishers. Moreover, we have uncovered numerous linear relationships between plaintext and ciphertext. Real-time plaintext recovery is achievable based on these established relationships. On average, it takes 5 seconds to recover the plaintext for a fixed ciphertext of Shadow-32. Thirdly, many properties of the propagation of difference through SIMON-like function are illustrated. According to these properties, various differential distinguishers up to full rounds are presented, allowing real-time key recovery. Specifically, the 64-bit master key of Shadow-32 can be retrieved in around two days on average. Experiments verify all our results.
随着物联网(IoT)的快速发展,许多创新的轻量级块状密码被引入,以满足物联网设备严格的安全要求。其中,Shadow 密码以其紧凑性脱颖而出,特别适合部署在资源有限的物联网节点上(IEEE 物联网期刊,2021 年)。本文首次展示了对 Shadow 的两种实时攻击:实时明文恢复和密钥恢复。首先,我们讨论了 Shadow 的许多特性,说明了两轮 Shadow 的等效表示法和各轮密钥之间的关系。其次,我们引入了多个两轮迭代线性近似值。利用这些近似值可以推导出全轮次线性区分器。此外,我们还发现了明文和密文之间的许多线性关系。根据这些已建立的关系,可以实现实时明文恢复。对于 Shadow-32 的固定密文,恢复明文平均需要 5 秒钟。第三,说明了通过 SIMON 类函数传播差分的许多特性。根据这些特性,提出了各种差分区分器,可实现实时密钥恢复。具体来说,Shadow-32 的 64 位主密钥平均可在两天左右找回。实验验证了我们的所有结果。
{"title":"Single-Key Attack on Full-Round Shadow Designed for IoT Nodes","authors":"Yuhan Zhang;Wenling Wu;Lei Zhang;Yafei Zheng","doi":"10.1109/TC.2024.3449040","DOIUrl":"10.1109/TC.2024.3449040","url":null,"abstract":"With the rapid advancement of the Internet of Things (IoT), many innovative lightweight block ciphers have been introduced to meet the stringent security demands of IoT devices. Among these, the Shadow cipher stands out for its compactness, making it particularly well-suited for deployment in resource-constrained IoT nodes (IEEE Internet of Things Journal, 2021). This paper demonstrates two real-time attacks on Shadow for the first time: real-time plaintext recovery and key recovery. Firstly, numerous properties of Shadow are discussed, illustrating an equivalent representation of the two-round Shadow and the relationship between the round keys. Secondly, we introduce multiple two-round iterative linear approximations. Employing these approximations enables the derivation of full-round linear distinguishers. Moreover, we have uncovered numerous linear relationships between plaintext and ciphertext. Real-time plaintext recovery is achievable based on these established relationships. On average, it takes 5 seconds to recover the plaintext for a fixed ciphertext of Shadow-32. Thirdly, many properties of the propagation of difference through SIMON-like function are illustrated. According to these properties, various differential distinguishers up to full rounds are presented, allowing real-time key recovery. Specifically, the 64-bit master key of Shadow-32 can be retrieved in around two days on average. Experiments verify all our results.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2776-2790"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ROLoad-PMP: Securing Sensitive Operations for Kernels and Bare-Metal Firmware ROLoad-PMP:确保内核和裸机固件敏感操作的安全
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449105
Wende Tan;Chenyang Li;Yangyu Chen;Yuan Li;Chao Zhang;Jianping Wu
A common way for attackers to compromise victim systems is hijacking sensitive operations (e.g., control-flow transfers) with attacker-controlled inputs. Existing solutions in general only protect parts of these targets and have high performance overheads, which are impractical and hard to deploy on systems with limited resources (e.g., IoT devices) or for low-level software like kernels and bare-metal firmware. In this paper, we present a lightweight hardware-software co-design solution ROLoad-PMP to protect sensitive operations from being hijacked for low-level software. First, we propose new instructions, which only load data from read-only memory regions with specific keys, to guarantee the integrity of pointees pointed by (potentially corrupted) data pointers. Then, we provide a program hardening mechanism to protect sensitive operations, by classifying and placing their operands into read-only memory with different keys at compile-time and loading them with ROLoad-PMP-family instructions at runtime. We have implemented an FPGA-based prototype of ROLoad-PMP based on RISC-V, and demonstrated an important defense application, i.e., forward-edge control-flow integrity. Results showed that ROLoad-PMP only costs few extra hardware resources ($lt 1.40%$). Moreover, it enables many lightweight (e.g., with negligible overheads $lt 0.853%$) defenses, and provides broader and stronger security guarantees than existing hardware solutions, e.g., ARM BTI and Intel CET.
攻击者入侵受害系统的一种常见方式是利用攻击者控制的输入劫持敏感操作(如控制流传输)。现有的解决方案一般只能保护这些目标的一部分,而且性能开销很高,在资源有限的系统(如物联网设备)或内核和裸机固件等低级软件上部署不切实际,也很困难。在本文中,我们提出了一种轻量级软硬件协同设计解决方案 ROLoad-PMP,以保护低级软件的敏感操作不被劫持。首先,我们提出了新的指令,这些指令只从具有特定密钥的只读内存区域加载数据,以保证由(可能损坏的)数据指针指向的点的完整性。然后,我们提供了一种程序加固机制,通过在编译时将操作数分类并放入具有不同密钥的只读存储器,并在运行时使用 ROLoad-PMP 系列指令加载操作数,来保护敏感操作。我们基于 RISC-V 实现了基于 FPGA 的 ROLoad-PMP 原型,并演示了一个重要的防御应用,即前沿控制流完整性。结果表明,ROLoad-PMP只需花费很少的额外硬件资源(1.40美元)。此外,它还实现了许多轻量级(例如,开销可忽略不计)的防御,并提供了比现有硬件解决方案(如ARM BTI和英特尔CET)更广泛、更强大的安全保证。
{"title":"ROLoad-PMP: Securing Sensitive Operations for Kernels and Bare-Metal Firmware","authors":"Wende Tan;Chenyang Li;Yangyu Chen;Yuan Li;Chao Zhang;Jianping Wu","doi":"10.1109/TC.2024.3449105","DOIUrl":"10.1109/TC.2024.3449105","url":null,"abstract":"A common way for attackers to compromise victim systems is hijacking sensitive operations (e.g., control-flow transfers) with attacker-controlled inputs. Existing solutions in general only protect parts of these targets and have high performance overheads, which are impractical and hard to deploy on systems with limited resources (e.g., IoT devices) or for low-level software like kernels and bare-metal firmware. In this paper, we present a lightweight hardware-software co-design solution ROLoad-PMP to protect sensitive operations from being hijacked for low-level software. First, we propose new instructions, which only load data from read-only memory regions with specific keys, to guarantee the integrity of pointees pointed by (potentially corrupted) data pointers. Then, we provide a program hardening mechanism to protect sensitive operations, by classifying and placing their operands into read-only memory with different keys at compile-time and loading them with ROLoad-PMP-family instructions at runtime. We have implemented an FPGA-based prototype of ROLoad-PMP based on RISC-V, and demonstrated an important defense application, i.e., forward-edge control-flow integrity. Results showed that ROLoad-PMP only costs few extra hardware resources (\u0000<inline-formula><tex-math>$lt 1.40%$</tex-math></inline-formula>\u0000). Moreover, it enables many lightweight (e.g., with negligible overheads \u0000<inline-formula><tex-math>$lt 0.853%$</tex-math></inline-formula>\u0000) defenses, and provides broader and stronger security guarantees than existing hardware solutions, e.g., ARM BTI and Intel CET.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2722-2733"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1