IEEE Transactions on Computers最新文献_第8页

ROLoad-PMP: Securing Sensitive Operations for Kernels and Bare-Metal Firmware ROLoad-PMP：确保内核和裸机固件敏感操作的安全

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-23 DOI: 10.1109/TC.2024.3449105

Wende Tan;Chenyang Li;Yangyu Chen;Yuan Li;Chao Zhang;Jianping Wu

A common way for attackers to compromise victim systems is hijacking sensitive operations (e.g., control-flow transfers) with attacker-controlled inputs. Existing solutions in general only protect parts of these targets and have high performance overheads, which are impractical and hard to deploy on systems with limited resources (e.g., IoT devices) or for low-level software like kernels and bare-metal firmware. In this paper, we present a lightweight hardware-software co-design solution ROLoad-PMP to protect sensitive operations from being hijacked for low-level software. First, we propose new instructions, which only load data from read-only memory regions with specific keys, to guarantee the integrity of pointees pointed by (potentially corrupted) data pointers. Then, we provide a program hardening mechanism to protect sensitive operations, by classifying and placing their operands into read-only memory with different keys at compile-time and loading them with ROLoad-PMP-family instructions at runtime. We have implemented an FPGA-based prototype of ROLoad-PMP based on RISC-V, and demonstrated an important defense application, i.e., forward-edge control-flow integrity. Results showed that ROLoad-PMP only costs few extra hardware resources (

$lt 1.40%$

). Moreover, it enables many lightweight (e.g., with negligible overheads

$lt 0.853%$

) defenses, and provides broader and stronger security guarantees than existing hardware solutions, e.g., ARM BTI and Intel CET.

攻击者入侵受害系统的一种常见方式是利用攻击者控制的输入劫持敏感操作（如控制流传输）。现有的解决方案一般只能保护这些目标的一部分，而且性能开销很高，在资源有限的系统（如物联网设备）或内核和裸机固件等低级软件上部署不切实际，也很困难。在本文中，我们提出了一种轻量级软硬件协同设计解决方案 ROLoad-PMP，以保护低级软件的敏感操作不被劫持。首先，我们提出了新的指令，这些指令只从具有特定密钥的只读内存区域加载数据，以保证由（可能损坏的）数据指针指向的点的完整性。然后，我们提供了一种程序加固机制，通过在编译时将操作数分类并放入具有不同密钥的只读存储器，并在运行时使用 ROLoad-PMP 系列指令加载操作数，来保护敏感操作。我们基于 RISC-V 实现了基于 FPGA 的 ROLoad-PMP 原型，并演示了一个重要的防御应用，即前沿控制流完整性。结果表明，ROLoad-PMP只需花费很少的额外硬件资源（1.40美元）。此外，它还实现了许多轻量级（例如，开销可忽略不计）的防御，并提供了比现有硬件解决方案（如ARM BTI和英特尔CET）更广泛、更强大的安全保证。

{"title":"ROLoad-PMP: Securing Sensitive Operations for Kernels and Bare-Metal Firmware","authors":"Wende Tan;Chenyang Li;Yangyu Chen;Yuan Li;Chao Zhang;Jianping Wu","doi":"10.1109/TC.2024.3449105","DOIUrl":"10.1109/TC.2024.3449105","url":null,"abstract":"A common way for attackers to compromise victim systems is hijacking sensitive operations (e.g., control-flow transfers) with attacker-controlled inputs. Existing solutions in general only protect parts of these targets and have high performance overheads, which are impractical and hard to deploy on systems with limited resources (e.g., IoT devices) or for low-level software like kernels and bare-metal firmware. In this paper, we present a lightweight hardware-software co-design solution ROLoad-PMP to protect sensitive operations from being hijacked for low-level software. First, we propose new instructions, which only load data from read-only memory regions with specific keys, to guarantee the integrity of pointees pointed by (potentially corrupted) data pointers. Then, we provide a program hardening mechanism to protect sensitive operations, by classifying and placing their operands into read-only memory with different keys at compile-time and loading them with ROLoad-PMP-family instructions at runtime. We have implemented an FPGA-based prototype of ROLoad-PMP based on RISC-V, and demonstrated an important defense application, i.e., forward-edge control-flow integrity. Results showed that ROLoad-PMP only costs few extra hardware resources (\u0000<inline-formula><tex-math>$lt 1.40%$</tex-math></inline-formula>\u0000). Moreover, it enables many lightweight (e.g., with negligible overheads \u0000<inline-formula><tex-math>$lt 0.853%$</tex-math></inline-formula>\u0000) defenses, and provides broader and stronger security guarantees than existing hardware solutions, e.g., ARM BTI and Intel CET.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2722-2733"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sifter: An Efficient Operator Auto-Tuner with Speculative Design Space Exploration for Deep Learning Compiler Sifter：针对深度学习编译器的高效运算器自动调整器与推测性设计空间探索

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-22 DOI: 10.1109/tc.2024.3441820

Qianhe Zhao, Rui Wang, Yi Liu, Hailong Yang, Zhongzhi Luan, Depei Qian

引用次数: 0

Deep Learning Acceleration Optimization of Stress Boundary Value Problem Solvers 深度学习加速优化应力边界值问题求解器

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441828

Yongsheng Chen;Zhuowei Wang;Xiaoyu Song;Zhe Yan;Lianglun Cheng

The solution to boundary value problems is of great significance in industrial software applications. In this paper, we propose a novel deep learning method for simulating stress field distributions in simply supported beams, aiming to serve as a solver for stress boundary value problems. Our regression network, Stress-EA, utilizes the convolution encoder module and additive attention to accurately estimate the stress in the beam. By comparing the Stress-EA prediction results with the stress values calculated using ABAQUS, we achieve a mean absolute error (MAE) of less than 0.06. This indicates a high level of consistency between the stress values obtained from the two approaches. Moreover, the prediction time of Stress-EA is significantly shorter, taking only 0.0011s, compared to the calculation time of ABAQUS, which is 16.91s. This demonstrates the high accuracy and low computational latency of our model. Furthermore, our model exhibits smaller model parameters, requires less computation, and has a shorter prediction time compared to training results obtained using classic and advanced networks. To accelerate training, we utilize data parallel methods, achieving up to 1.89 speedup on a dual-GPU platform without compromising accuracy. This advancement enhances the computing efficiency for large-scale industrial software applications.

边界值问题的求解在工业软件应用中具有重要意义。在本文中，我们提出了一种模拟简支梁应力场分布的新型深度学习方法，旨在作为应力边界值问题的求解器。我们的回归网络 Stress-EA 利用卷积编码器模块和加法注意来准确估计梁中的应力。通过比较 Stress-EA 预测结果和使用 ABAQUS 计算的应力值，我们发现平均绝对误差 (MAE) 小于 0.06。这表明两种方法得出的应力值高度一致。此外，与 ABAQUS 的 16.91s 计算时间相比，Stress-EA 的预测时间大大缩短，仅为 0.0011s。这表明我们的模型具有高精度和低计算延迟的特点。此外，与使用经典网络和高级网络获得的训练结果相比，我们的模型显示出更小的模型参数、更少的计算量和更短的预测时间。为了加快训练速度，我们采用了数据并行方法，在双 GPU 平台上实现了高达 1.89 的速度提升，同时不影响准确性。这一进步提高了大规模工业软件应用的计算效率。

{"title":"Deep Learning Acceleration Optimization of Stress Boundary Value Problem Solvers","authors":"Yongsheng Chen;Zhuowei Wang;Xiaoyu Song;Zhe Yan;Lianglun Cheng","doi":"10.1109/TC.2024.3441828","DOIUrl":"10.1109/TC.2024.3441828","url":null,"abstract":"The solution to boundary value problems is of great significance in industrial software applications. In this paper, we propose a novel deep learning method for simulating stress field distributions in simply supported beams, aiming to serve as a solver for stress boundary value problems. Our regression network, Stress-EA, utilizes the convolution encoder module and additive attention to accurately estimate the stress in the beam. By comparing the Stress-EA prediction results with the stress values calculated using ABAQUS, we achieve a mean absolute error (MAE) of less than 0.06. This indicates a high level of consistency between the stress values obtained from the two approaches. Moreover, the prediction time of Stress-EA is significantly shorter, taking only 0.0011s, compared to the calculation time of ABAQUS, which is 16.91s. This demonstrates the high accuracy and low computational latency of our model. Furthermore, our model exhibits smaller model parameters, requires less computation, and has a shorter prediction time compared to training results obtained using classic and advanced networks. To accelerate training, we utilize data parallel methods, achieving up to 1.89 speedup on a dual-GPU platform without compromising accuracy. This advancement enhances the computing efficiency for large-scale industrial software applications.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2844-2854"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TOP: Towards Open & Predictable Heterogeneous SoCs TOP：迈向开放和可预测的异构 SoC

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441849

Luca Valente;Francesco Restuccia;Davide Rossi;Ryan Kastner;Luca Benini

Ensuring predictability in modern real-time Systems-on-Chip (SoCs) is an increasingly critical concern for many application domains such as automotive, robotics, and industrial automation. An effective approach involves the modeling and development of hardware components, such as interconnects and shared memory resources, to evaluate or enforce their deterministic behavior. Unfortunately, these IPs are often closed-source, and these studies are limited to the single modules that must later be integrated with third-party IPs in more complex SoCs, hindering the precision and scope of modeling and compromising the overall predictability. With the coming-of-age of open-source instruction set architectures (RISC-V) and hardware, major opportunities for changing this status quo are emerging. This study introduces an innovative methodology for modeling and analyzing State-of-the-Art (SoA) open-source SoCs for low-power cyber-physical systems. Our approach models and analyzes the entire set of open-source IPs within these SoCs and then provides a comprehensive analysis of the entire architecture. We validate this methodology on a sample heterogenous low-power RISC-V architecture through RTL simulation and FPGA implementation, minimizing pessimism in bounding the service time of transactions crossing the architecture between 28% and 1%, which is considerably lower when compared to similar SoA works.

对于汽车、机器人和工业自动化等许多应用领域来说，确保现代实时片上系统（SoC）的可预测性越来越重要。一种有效的方法是对硬件组件（如互连和共享内存资源）进行建模和开发，以评估或强制执行其确定性行为。遗憾的是，这些 IP 通常是闭源的，而且这些研究仅限于单个模块，而这些模块随后必须与第三方 IP 集成到更复杂的 SoC 中，这就阻碍了建模的精度和范围，并影响了整体的可预测性。随着开源指令集架构（RISC-V）和硬件时代的到来，改变这一现状的重大机遇正在出现。本研究介绍了一种创新方法，用于对低功耗网络物理系统的最新（SoA）开源 SoC 进行建模和分析。我们的方法对这些 SoC 中的整套开源 IP 进行建模和分析，然后对整个架构进行全面分析。我们通过 RTL 仿真和 FPGA 实现，在一个样本异构低功耗 RISC-V 架构上验证了这一方法，最大限度地减少了悲观情绪，将跨越该架构的事务的服务时间限制在 28% 和 1% 之间，与类似的 SoA 作品相比大大降低。

{"title":"TOP: Towards Open & Predictable Heterogeneous SoCs","authors":"Luca Valente;Francesco Restuccia;Davide Rossi;Ryan Kastner;Luca Benini","doi":"10.1109/TC.2024.3441849","DOIUrl":"10.1109/TC.2024.3441849","url":null,"abstract":"Ensuring predictability in modern real-time Systems-on-Chip (SoCs) is an increasingly critical concern for many application domains such as automotive, robotics, and industrial automation. An effective approach involves the modeling and development of hardware components, such as interconnects and shared memory resources, to evaluate or enforce their deterministic behavior. Unfortunately, these IPs are often closed-source, and these studies are limited to the single modules that must later be integrated with third-party IPs in more complex SoCs, hindering the precision and scope of modeling and compromising the overall predictability. With the coming-of-age of open-source instruction set architectures (RISC-V) and hardware, major opportunities for changing this status quo are emerging. This study introduces an innovative methodology for modeling and analyzing State-of-the-Art (SoA) open-source SoCs for low-power cyber-physical systems. Our approach models and analyzes the entire set of open-source IPs within these SoCs and then provides a comprehensive analysis of the entire architecture. We validate this methodology on a sample heterogenous low-power RISC-V architecture through RTL simulation and FPGA implementation, minimizing pessimism in bounding the service time of transactions crossing the architecture between 28% and 1%, which is considerably lower when compared to similar SoA works.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2678-2692"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning-Augmented Scheduling 学习增强型调度

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441856

Tianming Zhao;Wei Li;Albert Y. Zomaya

The recent revival in learning theory has provided us with improved capabilities for accurate predictions. This work contributes to an emerging research agenda of online scheduling with predictions by studying makespan minimization in uniformly related machine non-clairvoyant scheduling with job size predictions. Our task is to design online algorithms that use predictions and have performance guarantees tied to prediction quality. We first propose a simple algorithm-independent prediction error metric to quantify prediction quality. Then we design an offline improved 2-relaxed decision procedure approximating the optimal schedule to effectively use the predictions. With the decision procedure, we propose an online

$O(min{logeta,log m})$

-competitive static scheduling algorithm assuming a known prediction error. We use this algorithm to construct a robust

$O(min{logeta,log m})$

-competitive static scheduling algorithm that does not assume a known error. Finally, we extend these static scheduling algorithms to address dynamic scheduling where jobs arrive over time. The dynamic scheduling algorithms attain the same competitive ratios as the static ones. The presented algorithms require just moderate predictions to break the

$Omega(log m)$

competitive ratio lower bound, showing the potential of predictions in managing uncertainty.

最近学习理论的复兴为我们提供了更高的准确预测能力。本研究通过研究具有作业大小预测功能的均匀相关机器非千里眼调度中的时间跨度最小化，为具有预测功能的在线调度这一新兴研究议程做出了贡献。我们的任务是设计使用预测的在线算法，并将性能保证与预测质量挂钩。我们首先提出了一个与算法无关的简单预测误差度量来量化预测质量。然后，我们设计了一个离线改进的 2-relaxed 决策程序，该程序近似于最佳时间表，可有效利用预测结果。利用该决策程序，我们提出了一个在线 $O(min{logeta,log m})$ 竞争性静态调度算法，假设预测误差已知。我们使用该算法构建了一个鲁棒的$O(min{log/eta,log m})$竞争性静态调度算法，该算法不假设已知误差。最后，我们扩展了这些静态调度算法，以解决工作随时间到达的动态调度问题。动态调度算法达到了与静态算法相同的竞争比率。所提出的算法只需要适度的预测就能打破$Omega(log m)$竞争比下限，这显示了预测在管理不确定性方面的潜力。

{"title":"Learning-Augmented Scheduling","authors":"Tianming Zhao;Wei Li;Albert Y. Zomaya","doi":"10.1109/TC.2024.3441856","DOIUrl":"10.1109/TC.2024.3441856","url":null,"abstract":"The recent revival in learning theory has provided us with improved capabilities for accurate predictions. This work contributes to an emerging research agenda of online scheduling with predictions by studying makespan minimization in uniformly related machine non-clairvoyant scheduling with job size predictions. Our task is to design online algorithms that use predictions and have performance guarantees tied to prediction quality. We first propose a simple algorithm-independent prediction error metric to quantify prediction quality. Then we design an offline improved 2-relaxed decision procedure approximating the optimal schedule to effectively use the predictions. With the decision procedure, we propose an online \u0000<inline-formula><tex-math>$O(min{logeta,log m})$</tex-math></inline-formula>\u0000-competitive static scheduling algorithm assuming a known prediction error. We use this algorithm to construct a robust \u0000<inline-formula><tex-math>$O(min{logeta,log m})$</tex-math></inline-formula>\u0000-competitive static scheduling algorithm that does not assume a known error. Finally, we extend these static scheduling algorithms to address dynamic scheduling where jobs arrive over time. The dynamic scheduling algorithms attain the same competitive ratios as the static ones. The presented algorithms require just moderate predictions to break the \u0000<inline-formula><tex-math>$Omega(log m)$</tex-math></inline-formula>\u0000 competitive ratio lower bound, showing the potential of predictions in managing uncertainty.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2548-2562"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning PruneAug：利用自动分层块剪枝技术在多种稀疏平台上弥合 DNN 修剪和推理延迟问题

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441855

Hanfei Geng;Yifei Liu;Yujie Zheng;Li Lyna Zhang;Jingwei Sun;Yujing Wang;Yang Wang;Guangzhong Sun;Mao Yang;Ting Cao;Yunxin Liu

Although pruning is an effective technique to reduce the number of weights in deep neural networks (DNNs), it remains challenging for the resulting sparse networks to perform low-latency inference on everyday hardware. This problem is mainly caused by the incompatibility between the unstructured sparsity adopted for accuracy preservation and the sparse platform's (the combination of sparse kernel library and the underlying hardware) expectation of regular sparse patterns. In order to resolve this conflict, we propose PruneAug, an augmentation over existing unstructured pruning methods that finds block-sparse networks with much lower latency but preserves the accuracy. The fundamental idea of PruneAug is to prune the network with a layerwise block dimension assignment in a platform-aware fashion. Subject to an accuracy-loss constraint, PruneAug minimizes the latency of the block sparse network by jointly optimizing this layerwise block dimension assignment and the network's sparsity level. Admittedly, this approach expands the solution space. To curb our search cost, we include multiple optimizations while designing PruneAug's search space and strategy. Our evaluation over diverse pruning methods, DNNs, datasets, and sparse platforms shows that PruneAug enables different pruning methods to achieve speedup (as much as

$boldsymbol{sim}13boldsymbol{times}$

depending on the platform) while maintaining competitive accuracy relative to unstructured sparsity, extracting the full potential of sparse platforms.

尽管剪枝是减少深度神经网络（DNN）中权重数量的有效技术，但由此产生的稀疏网络要在日常硬件上执行低延迟推理仍具有挑战性。造成这一问题的主要原因是，为保持准确性而采用的非结构稀疏性与稀疏平台（稀疏内核库与底层硬件的组合）对规则稀疏模式的期望之间不兼容。为了解决这一矛盾，我们提出了 PruneAug，它是对现有非结构化剪枝方法的一种增强，能以更低的延迟找到块稀疏网络，同时保持准确性。PruneAug 的基本思想是以平台感知的方式，通过按层分配块维度来剪枝网络。在精度损失约束条件下，PruneAug 通过联合优化层向块维度分配和网络的稀疏程度，最大限度地降低了块稀疏网络的延迟。诚然，这种方法扩大了求解空间。为了降低搜索成本，我们在设计 PruneAug 的搜索空间和策略时进行了多重优化。我们对不同的剪枝方法、DNN、数据集和稀疏平台进行的评估表明，PruneAug 可以让不同的剪枝方法实现提速（根据平台的不同，可提速多达 $boldsymbol{sim}13boldsymbol{times}$ ），同时保持相对于非结构化稀疏性的具有竞争力的准确性，充分挖掘稀疏平台的潜力。

{"title":"PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning","authors":"Hanfei Geng;Yifei Liu;Yujie Zheng;Li Lyna Zhang;Jingwei Sun;Yujing Wang;Yang Wang;Guangzhong Sun;Mao Yang;Ting Cao;Yunxin Liu","doi":"10.1109/TC.2024.3441855","DOIUrl":"10.1109/TC.2024.3441855","url":null,"abstract":"Although pruning is an effective technique to reduce the number of weights in deep neural networks (DNNs), it remains challenging for the resulting sparse networks to perform low-latency inference on everyday hardware. This problem is mainly caused by the incompatibility between the unstructured sparsity adopted for accuracy preservation and the sparse platform's (the combination of sparse kernel library and the underlying hardware) expectation of regular sparse patterns. In order to resolve this conflict, we propose PruneAug, an augmentation over existing unstructured pruning methods that finds block-sparse networks with much lower latency but preserves the accuracy. The fundamental idea of PruneAug is to prune the network with a layerwise block dimension assignment in a platform-aware fashion. Subject to an accuracy-loss constraint, PruneAug minimizes the latency of the block sparse network by jointly optimizing this layerwise block dimension assignment and the network's sparsity level. Admittedly, this approach expands the solution space. To curb our search cost, we include multiple optimizations while designing PruneAug's search space and strategy. Our evaluation over diverse pruning methods, DNNs, datasets, and sparse platforms shows that PruneAug enables different pruning methods to achieve speedup (as much as \u0000<inline-formula><tex-math>$boldsymbol{sim}13boldsymbol{times}$</tex-math></inline-formula>\u0000 depending on the platform) while maintaining competitive accuracy relative to unstructured sparsity, extracting the full potential of sparse platforms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2576-2589"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Memristor-Based Approximate Query Architecture for In-Memory Hyperdimensional Computing 基于 Memristor 的内存超维计算近似查询架构

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441861

Tianyang Yu;Bi Wu;Ke Chen;Gong Zhang;Weiqiang Liu

As a new computing paradigm, hyperdimensional computing (HDC) has gradually manifested its advantages in edge-side intelligent applications by virtue of its interpretability, hardware-friendliness and robustness. The core of HDC is to encode input samples into a hypervector, and then use it to query the class hypervector space. Compared with the conventional architecture that uses CMOS-based circuits to complete the computation in the query operation, the hyperdimensional associative memory (HAM) enables the query operation to be completed in memory, which significantly reduces the query delay and energy consumption. However, the existing HDC algorithms require the HAM to achieve high precision query in inference, which leads to the complex structure of the HAM, and thus makes the area and energy consumption of the HAM unable to be further reduced. In this paper, a novel efficient HAM architecture based on approximate query method is proposed, to simplify the existing architecture. Meanwhile, a training method of HDC which matches the proposed HAM architecture is proposed to compensate for the decrease in accuracy caused by approximate query. Experimental results show that the proposed HAM framework can save more than 60% of area and energy consumption, and achieve accuracy comparable to existing state-of-the-art methods by using the proposed training method.

作为一种新的计算范式，超维计算（HDC）凭借其可解释性、硬件友好性和鲁棒性，在边缘智能应用中逐渐显现出其优势。超维计算的核心是将输入样本编码成超向量，然后利用超向量查询类超向量空间。与使用基于 CMOS 电路完成查询运算的传统架构相比，超维度关联存储器（HAM）可在内存中完成查询运算，从而大大减少查询延迟和能耗。然而，现有的 HDC 算法要求 HAM 在推理中实现高精度查询，这导致 HAM 结构复杂，从而使 HAM 的面积和能耗无法进一步降低。本文提出了一种基于近似查询方法的新型高效 HAM 架构，以简化现有架构。同时，还提出了一种与所提 HAM 架构相匹配的 HDC 训练方法，以弥补近似查询导致的精度下降。实验结果表明，通过使用所提出的训练方法，所提出的 HAM 框架可以节省 60% 以上的面积和能耗，并达到与现有先进方法相当的精度。

{"title":"Memristor-Based Approximate Query Architecture for In-Memory Hyperdimensional Computing","authors":"Tianyang Yu;Bi Wu;Ke Chen;Gong Zhang;Weiqiang Liu","doi":"10.1109/TC.2024.3441861","DOIUrl":"10.1109/TC.2024.3441861","url":null,"abstract":"As a new computing paradigm, hyperdimensional computing (HDC) has gradually manifested its advantages in edge-side intelligent applications by virtue of its interpretability, hardware-friendliness and robustness. The core of HDC is to encode input samples into a hypervector, and then use it to query the class hypervector space. Compared with the conventional architecture that uses CMOS-based circuits to complete the computation in the query operation, the hyperdimensional associative memory (HAM) enables the query operation to be completed in memory, which significantly reduces the query delay and energy consumption. However, the existing HDC algorithms require the HAM to achieve high precision query in inference, which leads to the complex structure of the HAM, and thus makes the area and energy consumption of the HAM unable to be further reduced. In this paper, a novel efficient HAM architecture based on approximate query method is proposed, to simplify the existing architecture. Meanwhile, a training method of HDC which matches the proposed HAM architecture is proposed to compensate for the decrease in accuracy caused by approximate query. Experimental results show that the proposed HAM framework can save more than 60% of area and energy consumption, and achieve accuracy comparable to existing state-of-the-art methods by using the proposed training method.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2605-2618"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AMBEA: Aggressive Maximal Biclique Enumeration in Large Bipartite Graph Computing AMBEA：大型双向图计算中的进取最大双向枚举

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441864

Zhe Pan;Xu Li;Shuibing He;Xuechen Zhang;Rui Wang;Yunjun Gao;Gang Chen;Xian-He Sun

Maximal biclique enumeration (MBE) in bipartite graphs is a fundamental problem in data mining with widespread applications. Many recent works solve this problem based on the set-enumeration (SE) tree, which sequentially traverses vertices to generate the enumeration tree nodes representing distinct bicliques, then checks whether these bicliques are maximal or not. However, existing MBE algorithms only expand bicliques with untraversed vertices to ensure distinction, which often necessitate extensive node checks to eliminate non-maximal bicliques, resulting in significant computational overhead during the enumeration process. To address this issue, we propose an aggressive set-enumeration (ASE) tree that aggressively expands all bicliques to their maximal form, thus avoiding costly node checks on non-maximal bicliques. This aggressive enumeration may produce multiple duplicate maximal bicliques, but we efficiently eliminate these duplicates by leveraging the connection between parent and child nodes and conducting low-cost node checking. Additionally, we introduce an aggressive merge-based pruning (AMP) approach that aggressively merges vertices sharing the same local neighbors. This helps prune numerous duplicate node generations caused by subsets of merged vertices. We integrate the AMP approach into the ASE tree, and present the Aggressive Maximal Biclique Enumeration Algorithm (AMBEA). Experimental results show that AMBEA is 1.15

$times$

to 5.32

$times$

faster than its closest competitor and exhibits better scalability and parallelization capabilities on larger bipartite graphs.

双叉图中的最大双叉枚举（MBE）是数据挖掘中的一个基本问题，应用广泛。最近的许多研究都是基于集合枚举（SE）树来解决这个问题的，SE 树会依次遍历顶点，生成代表不同二叉的枚举树节点，然后检查这些二叉是否最大。然而，现有的 MBE 算法只扩展具有未遍历顶点的二叉以确保区分，这往往需要进行大量节点检查以消除非最大二叉，从而在枚举过程中产生大量计算开销。为了解决这个问题，我们提出了一种积极的集合枚举（ASE）树，它能积极地将所有二叉树扩展为最大形式，从而避免了对非最大二叉树进行代价高昂的节点检查。这种积极的枚举可能会产生多个重复的最大二叉，但我们利用父节点和子节点之间的连接，并进行低成本的节点检查，从而有效地消除了这些重复。此外，我们还引入了一种基于合并的积极剪枝（AMP）方法，该方法会积极合并共享相同本地邻居的顶点。这有助于剪除由合并顶点子集引起的大量重复节点生成。我们将 AMP 方法集成到 ASE 树中，并提出了进取最大双斜枚举算法（AMBEA）。实验结果表明，AMBEA 比其最接近的竞争对手快 1.15 到 5.32 倍，并且在更大的双叉图上表现出更好的可扩展性和并行化能力。

{"title":"AMBEA: Aggressive Maximal Biclique Enumeration in Large Bipartite Graph Computing","authors":"Zhe Pan;Xu Li;Shuibing He;Xuechen Zhang;Rui Wang;Yunjun Gao;Gang Chen;Xian-He Sun","doi":"10.1109/TC.2024.3441864","DOIUrl":"10.1109/TC.2024.3441864","url":null,"abstract":"Maximal biclique enumeration (MBE) in bipartite graphs is a fundamental problem in data mining with widespread applications. Many recent works solve this problem based on the set-enumeration (SE) tree, which sequentially traverses vertices to generate the enumeration tree nodes representing distinct bicliques, then checks whether these bicliques are maximal or not. However, existing MBE algorithms only expand bicliques with untraversed vertices to ensure distinction, which often necessitate extensive node checks to eliminate non-maximal bicliques, resulting in significant computational overhead during the enumeration process. To address this issue, we propose an aggressive set-enumeration (ASE) tree that aggressively expands all bicliques to their maximal form, thus avoiding costly node checks on non-maximal bicliques. This aggressive enumeration may produce multiple duplicate maximal bicliques, but we efficiently eliminate these duplicates by leveraging the connection between parent and child nodes and conducting low-cost node checking. Additionally, we introduce an aggressive merge-based pruning (AMP) approach that aggressively merges vertices sharing the same local neighbors. This helps prune numerous duplicate node generations caused by subsets of merged vertices. We integrate the AMP approach into the ASE tree, and present the Aggressive Maximal Biclique Enumeration Algorithm (AMBEA). Experimental results show that AMBEA is 1.15\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 to 5.32\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 faster than its closest competitor and exhibits better scalability and parallelization capabilities on larger bipartite graphs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2664-2677"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Land of Oz: Resolving Orderless Writes in Zoned Namespace SSDs 绿野仙踪：解决分区命名空间固态硬盘中的无序写入问题

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441866

Yingjia Wang;You Zhou;Fei Wu;Jie Zhang;Ming-Chang Yang

Zoned Namespace (ZNS) SSDs present a new class of storage devices that promise low cost, stable performance, and software-defined capability. ZNS abstracts the SSD into an array of zones that can only be written sequentially. Although ZNS-compatible filesystems (e.g., F2FS) launch sequential writes, the zone write constraint may be violated due to orderless Linux I/O stack. Hence, the write queue depth of each zone must be limited to one, which severely degrades the performance of small writes.

In this paper, we propose oZNS SSD (o: orderless), which allows multiple writes to be submitted to a zone and processed out-of-order in the SSD. To bridge the gap between ZNS sequential write contract and orderless writes on flash, a lightweight indirection layer is introduced and carefully designed by exploiting the characteristics of out-of-order writes. Specifically, memory-efficient metadata structures are devised to record the write deviations of data pages, and a two-tier buffering mechanism is employed to reduce metadata and accelerate metadata access. Moreover, a read-try policy removes metadata access from read critical path and thus improves the data read efficiency. Experimental results show that oZNS can achieve up to 8.6$times$× higher write performance than traditional ZNS while providing comparable read performance.

分区命名空间（ZNS）固态硬盘是一种新型存储设备，具有成本低、性能稳定和软件定义功能等优点。ZNS 将固态硬盘抽象为只能按顺序写入的区域阵列。虽然与 ZNS 兼容的文件系统（如 F2FS）可以进行顺序写入，但由于 Linux I/O 堆栈的无序性，可能会违反区域写入约束。因此，每个区的写入队列深度必须限制为 1，这严重降低了小规模写入的性能。在本文中，我们提出了 oZNS SSD（o：无序），它允许向一个区域提交多个写入，并在 SSD 中进行无序处理。为了弥合 ZNS 连续写入合约与闪存上无序写入之间的差距，我们引入了一个轻量级间接层，并利用无序写入的特点进行了精心设计。具体来说，设计了内存效率高的元数据结构来记录数据页的写入偏差，并采用双层缓冲机制来减少元数据并加速元数据访问。此外，读取尝试策略将元数据访问从读取关键路径中移除，从而提高了数据读取效率。实验结果表明，oZNS 的写性能比传统 ZNS 高出 8.6 倍，同时读性能相当。

{"title":"Land of Oz: Resolving Orderless Writes in Zoned Namespace SSDs","authors":"Yingjia Wang;You Zhou;Fei Wu;Jie Zhang;Ming-Chang Yang","doi":"10.1109/TC.2024.3441866","DOIUrl":"10.1109/TC.2024.3441866","url":null,"abstract":"Zoned Namespace (ZNS) SSDs present a new class of storage devices that promise low cost, stable performance, and software-defined capability. ZNS abstracts the SSD into an array of zones that can only be written sequentially. Although ZNS-compatible filesystems (e.g., F2FS) launch sequential writes, the zone write constraint may be violated due to orderless Linux I/O stack. Hence, the write queue depth of each zone must be limited to one, which severely degrades the performance of small writes. \u0000<p>In this paper, we propose <i>oZNS SSD</i> (o: orderless), which allows multiple writes to be submitted to a zone and processed out-of-order in the SSD. To bridge the gap between ZNS sequential write contract and orderless writes on flash, a lightweight indirection layer is introduced and carefully designed by exploiting the characteristics of out-of-order writes. Specifically, memory-efficient metadata structures are devised to record the write deviations of data pages, and a two-tier buffering mechanism is employed to reduce metadata and accelerate metadata access. Moreover, a read-try policy removes metadata access from read critical path and thus improves the data read efficiency. Experimental results show that oZNS can achieve up to 8.6<inline-formula><tex-math>$times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic></alternatives></inline-formula> higher write performance than traditional ZNS while providing comparable read performance.</p>","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2520-2533"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Parallel Tag Cache for Hardware Managed Tagged Memory in Multicore Processors 多核处理器中硬件管理标签内存的并行标签缓存

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441835

Wei Song;Da Xie;Zihan Xue;Peng Liu

Hardware-managed tagged memory is the dominant way of supporting tags in current processor designs. Most of these processors reserve a hidden tag partition in the memory dedicated for tags and use a small tag cache (TC) to reduce the extra memory accesses introduced by the tag partition. Recent research shows that storing tags in a hierarchical tag table (HTT) inside the tag partition allows efficient compression in a TC, but the use of the HTT causes special data inconsistency issues when multiple related tag accesses are served simultaneously. How to design a parallel TC for multicore processors remains an open problem. We proposed the first TC capable of serving multiple tag accesses in parallel. It adopts a two-phase locking procedure to maintain data consistency and integrates seven techniques, where three are firstly proposed, and two are theoretical concepts materialized into usable solutions for the first time. Single-core and multicore performance results show that the proposed TC is effective in reducing both the extra amount of memory accesses to the tag partition and the overhead in execution time. It is important to provide enough number of trackers in multicore processors while providing extra trackers is beneficial for running HTT/TC ineffective applications.

在目前的处理器设计中，硬件管理标签内存是支持标签的主要方式。这些处理器大多在内存中为标签保留一个隐藏的标签分区，并使用一个小型标签缓存（TC）来减少标签分区带来的额外内存访问。最近的研究表明，在标签分区内的分层标签表（HTT）中存储标签可以在 TC 中实现高效压缩，但当同时进行多个相关标签访问时，使用 HTT 会导致特殊的数据不一致问题。如何为多核处理器设计并行 TC 仍是一个未决问题。我们提出了第一个能够并行提供多个标签访问服务的 TC。它采用两阶段锁定程序来保持数据一致性，并集成了七项技术，其中三项是首次提出，两项是首次将理论概念具体化为可用解决方案。单核和多核性能结果表明，所提出的 TC 能有效减少标签分区的额外内存访问量和执行时间开销。在多核处理器中提供足够数量的跟踪器非常重要，而提供额外的跟踪器则有利于运行 HTT/TC 无效应用程序。

{"title":"A Parallel Tag Cache for Hardware Managed Tagged Memory in Multicore Processors","authors":"Wei Song;Da Xie;Zihan Xue;Peng Liu","doi":"10.1109/TC.2024.3441835","DOIUrl":"10.1109/TC.2024.3441835","url":null,"abstract":"Hardware-managed tagged memory is the dominant way of supporting tags in current processor designs. Most of these processors reserve a hidden tag partition in the memory dedicated for tags and use a small tag cache (TC) to reduce the extra memory accesses introduced by the tag partition. Recent research shows that storing tags in a hierarchical tag table (HTT) inside the tag partition allows efficient compression in a TC, but the use of the HTT causes special data inconsistency issues when multiple related tag accesses are served simultaneously. How to design a parallel TC for multicore processors remains an open problem. We proposed the first TC capable of serving multiple tag accesses in parallel. It adopts a two-phase locking procedure to maintain data consistency and integrates seven techniques, where three are firstly proposed, and two are theoretical concepts materialized into usable solutions for the first time. Single-core and multicore performance results show that the proposed TC is effective in reducing both the extra amount of memory accesses to the tag partition and the overhead in execution time. It is important to provide enough number of trackers in multicore processors while providing extra trackers is beneficial for running HTT/TC ineffective applications.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2488-2503"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0