IEEE Transactions on Computers最新文献_第6页

AMBEA: Aggressive Maximal Biclique Enumeration in Large Bipartite Graph Computing AMBEA：大型双向图计算中的进取最大双向枚举

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441864

Zhe Pan;Xu Li;Shuibing He;Xuechen Zhang;Rui Wang;Yunjun Gao;Gang Chen;Xian-He Sun

Maximal biclique enumeration (MBE) in bipartite graphs is a fundamental problem in data mining with widespread applications. Many recent works solve this problem based on the set-enumeration (SE) tree, which sequentially traverses vertices to generate the enumeration tree nodes representing distinct bicliques, then checks whether these bicliques are maximal or not. However, existing MBE algorithms only expand bicliques with untraversed vertices to ensure distinction, which often necessitate extensive node checks to eliminate non-maximal bicliques, resulting in significant computational overhead during the enumeration process. To address this issue, we propose an aggressive set-enumeration (ASE) tree that aggressively expands all bicliques to their maximal form, thus avoiding costly node checks on non-maximal bicliques. This aggressive enumeration may produce multiple duplicate maximal bicliques, but we efficiently eliminate these duplicates by leveraging the connection between parent and child nodes and conducting low-cost node checking. Additionally, we introduce an aggressive merge-based pruning (AMP) approach that aggressively merges vertices sharing the same local neighbors. This helps prune numerous duplicate node generations caused by subsets of merged vertices. We integrate the AMP approach into the ASE tree, and present the Aggressive Maximal Biclique Enumeration Algorithm (AMBEA). Experimental results show that AMBEA is 1.15

$times$

to 5.32

$times$

faster than its closest competitor and exhibits better scalability and parallelization capabilities on larger bipartite graphs.

双叉图中的最大双叉枚举（MBE）是数据挖掘中的一个基本问题，应用广泛。最近的许多研究都是基于集合枚举（SE）树来解决这个问题的，SE 树会依次遍历顶点，生成代表不同二叉的枚举树节点，然后检查这些二叉是否最大。然而，现有的 MBE 算法只扩展具有未遍历顶点的二叉以确保区分，这往往需要进行大量节点检查以消除非最大二叉，从而在枚举过程中产生大量计算开销。为了解决这个问题，我们提出了一种积极的集合枚举（ASE）树，它能积极地将所有二叉树扩展为最大形式，从而避免了对非最大二叉树进行代价高昂的节点检查。这种积极的枚举可能会产生多个重复的最大二叉，但我们利用父节点和子节点之间的连接，并进行低成本的节点检查，从而有效地消除了这些重复。此外，我们还引入了一种基于合并的积极剪枝（AMP）方法，该方法会积极合并共享相同本地邻居的顶点。这有助于剪除由合并顶点子集引起的大量重复节点生成。我们将 AMP 方法集成到 ASE 树中，并提出了进取最大双斜枚举算法（AMBEA）。实验结果表明，AMBEA 比其最接近的竞争对手快 1.15 到 5.32 倍，并且在更大的双叉图上表现出更好的可扩展性和并行化能力。

{"title":"AMBEA: Aggressive Maximal Biclique Enumeration in Large Bipartite Graph Computing","authors":"Zhe Pan;Xu Li;Shuibing He;Xuechen Zhang;Rui Wang;Yunjun Gao;Gang Chen;Xian-He Sun","doi":"10.1109/TC.2024.3441864","DOIUrl":"10.1109/TC.2024.3441864","url":null,"abstract":"Maximal biclique enumeration (MBE) in bipartite graphs is a fundamental problem in data mining with widespread applications. Many recent works solve this problem based on the set-enumeration (SE) tree, which sequentially traverses vertices to generate the enumeration tree nodes representing distinct bicliques, then checks whether these bicliques are maximal or not. However, existing MBE algorithms only expand bicliques with untraversed vertices to ensure distinction, which often necessitate extensive node checks to eliminate non-maximal bicliques, resulting in significant computational overhead during the enumeration process. To address this issue, we propose an aggressive set-enumeration (ASE) tree that aggressively expands all bicliques to their maximal form, thus avoiding costly node checks on non-maximal bicliques. This aggressive enumeration may produce multiple duplicate maximal bicliques, but we efficiently eliminate these duplicates by leveraging the connection between parent and child nodes and conducting low-cost node checking. Additionally, we introduce an aggressive merge-based pruning (AMP) approach that aggressively merges vertices sharing the same local neighbors. This helps prune numerous duplicate node generations caused by subsets of merged vertices. We integrate the AMP approach into the ASE tree, and present the Aggressive Maximal Biclique Enumeration Algorithm (AMBEA). Experimental results show that AMBEA is 1.15\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 to 5.32\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 faster than its closest competitor and exhibits better scalability and parallelization capabilities on larger bipartite graphs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2664-2677"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Land of Oz: Resolving Orderless Writes in Zoned Namespace SSDs 绿野仙踪：解决分区命名空间固态硬盘中的无序写入问题

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441866

Yingjia Wang;You Zhou;Fei Wu;Jie Zhang;Ming-Chang Yang

Zoned Namespace (ZNS) SSDs present a new class of storage devices that promise low cost, stable performance, and software-defined capability. ZNS abstracts the SSD into an array of zones that can only be written sequentially. Although ZNS-compatible filesystems (e.g., F2FS) launch sequential writes, the zone write constraint may be violated due to orderless Linux I/O stack. Hence, the write queue depth of each zone must be limited to one, which severely degrades the performance of small writes.

In this paper, we propose oZNS SSD (o: orderless), which allows multiple writes to be submitted to a zone and processed out-of-order in the SSD. To bridge the gap between ZNS sequential write contract and orderless writes on flash, a lightweight indirection layer is introduced and carefully designed by exploiting the characteristics of out-of-order writes. Specifically, memory-efficient metadata structures are devised to record the write deviations of data pages, and a two-tier buffering mechanism is employed to reduce metadata and accelerate metadata access. Moreover, a read-try policy removes metadata access from read critical path and thus improves the data read efficiency. Experimental results show that oZNS can achieve up to 8.6$times$× higher write performance than traditional ZNS while providing comparable read performance.

分区命名空间（ZNS）固态硬盘是一种新型存储设备，具有成本低、性能稳定和软件定义功能等优点。ZNS 将固态硬盘抽象为只能按顺序写入的区域阵列。虽然与 ZNS 兼容的文件系统（如 F2FS）可以进行顺序写入，但由于 Linux I/O 堆栈的无序性，可能会违反区域写入约束。因此，每个区的写入队列深度必须限制为 1，这严重降低了小规模写入的性能。在本文中，我们提出了 oZNS SSD（o：无序），它允许向一个区域提交多个写入，并在 SSD 中进行无序处理。为了弥合 ZNS 连续写入合约与闪存上无序写入之间的差距，我们引入了一个轻量级间接层，并利用无序写入的特点进行了精心设计。具体来说，设计了内存效率高的元数据结构来记录数据页的写入偏差，并采用双层缓冲机制来减少元数据并加速元数据访问。此外，读取尝试策略将元数据访问从读取关键路径中移除，从而提高了数据读取效率。实验结果表明，oZNS 的写性能比传统 ZNS 高出 8.6 倍，同时读性能相当。

{"title":"Land of Oz: Resolving Orderless Writes in Zoned Namespace SSDs","authors":"Yingjia Wang;You Zhou;Fei Wu;Jie Zhang;Ming-Chang Yang","doi":"10.1109/TC.2024.3441866","DOIUrl":"10.1109/TC.2024.3441866","url":null,"abstract":"Zoned Namespace (ZNS) SSDs present a new class of storage devices that promise low cost, stable performance, and software-defined capability. ZNS abstracts the SSD into an array of zones that can only be written sequentially. Although ZNS-compatible filesystems (e.g., F2FS) launch sequential writes, the zone write constraint may be violated due to orderless Linux I/O stack. Hence, the write queue depth of each zone must be limited to one, which severely degrades the performance of small writes. \u0000In this paper, we propose oZNS SSD (o: orderless), which allows multiple writes to be submitted to a zone and processed out-of-order in the SSD. To bridge the gap between ZNS sequential write contract and orderless writes on flash, a lightweight indirection layer is introduced and carefully designed by exploiting the characteristics of out-of-order writes. Specifically, memory-efficient metadata structures are devised to record the write deviations of data pages, and a two-tier buffering mechanism is employed to reduce metadata and accelerate metadata access. Moreover, a read-try policy removes metadata access from read critical path and thus improves the data read efficiency. Experimental results show that oZNS can achieve up to 8.6<inline-formula><tex-math>$times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic></alternatives></inline-formula> higher write performance than traditional ZNS while providing comparable read performance.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2520-2533"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Parallel Tag Cache for Hardware Managed Tagged Memory in Multicore Processors 多核处理器中硬件管理标签内存的并行标签缓存

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441835

Wei Song;Da Xie;Zihan Xue;Peng Liu

Hardware-managed tagged memory is the dominant way of supporting tags in current processor designs. Most of these processors reserve a hidden tag partition in the memory dedicated for tags and use a small tag cache (TC) to reduce the extra memory accesses introduced by the tag partition. Recent research shows that storing tags in a hierarchical tag table (HTT) inside the tag partition allows efficient compression in a TC, but the use of the HTT causes special data inconsistency issues when multiple related tag accesses are served simultaneously. How to design a parallel TC for multicore processors remains an open problem. We proposed the first TC capable of serving multiple tag accesses in parallel. It adopts a two-phase locking procedure to maintain data consistency and integrates seven techniques, where three are firstly proposed, and two are theoretical concepts materialized into usable solutions for the first time. Single-core and multicore performance results show that the proposed TC is effective in reducing both the extra amount of memory accesses to the tag partition and the overhead in execution time. It is important to provide enough number of trackers in multicore processors while providing extra trackers is beneficial for running HTT/TC ineffective applications.

在目前的处理器设计中，硬件管理标签内存是支持标签的主要方式。这些处理器大多在内存中为标签保留一个隐藏的标签分区，并使用一个小型标签缓存（TC）来减少标签分区带来的额外内存访问。最近的研究表明，在标签分区内的分层标签表（HTT）中存储标签可以在 TC 中实现高效压缩，但当同时进行多个相关标签访问时，使用 HTT 会导致特殊的数据不一致问题。如何为多核处理器设计并行 TC 仍是一个未决问题。我们提出了第一个能够并行提供多个标签访问服务的 TC。它采用两阶段锁定程序来保持数据一致性，并集成了七项技术，其中三项是首次提出，两项是首次将理论概念具体化为可用解决方案。单核和多核性能结果表明，所提出的 TC 能有效减少标签分区的额外内存访问量和执行时间开销。在多核处理器中提供足够数量的跟踪器非常重要，而提供额外的跟踪器则有利于运行 HTT/TC 无效应用程序。

{"title":"A Parallel Tag Cache for Hardware Managed Tagged Memory in Multicore Processors","authors":"Wei Song;Da Xie;Zihan Xue;Peng Liu","doi":"10.1109/TC.2024.3441835","DOIUrl":"10.1109/TC.2024.3441835","url":null,"abstract":"Hardware-managed tagged memory is the dominant way of supporting tags in current processor designs. Most of these processors reserve a hidden tag partition in the memory dedicated for tags and use a small tag cache (TC) to reduce the extra memory accesses introduced by the tag partition. Recent research shows that storing tags in a hierarchical tag table (HTT) inside the tag partition allows efficient compression in a TC, but the use of the HTT causes special data inconsistency issues when multiple related tag accesses are served simultaneously. How to design a parallel TC for multicore processors remains an open problem. We proposed the first TC capable of serving multiple tag accesses in parallel. It adopts a two-phase locking procedure to maintain data consistency and integrates seven techniques, where three are firstly proposed, and two are theoretical concepts materialized into usable solutions for the first time. Single-core and multicore performance results show that the proposed TC is effective in reducing both the extra amount of memory accesses to the tag partition and the overhead in execution time. It is important to provide enough number of trackers in multicore processors while providing extra trackers is beneficial for running HTT/TC ineffective applications.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2488-2503"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Performance Tensor-Train Primitives Using GPU Tensor Cores 利用 GPU 张量核实现高性能张量训练原语

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441831

Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid

Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called Density Matrix Renormalization Group (DMRG). In performance evaluations, our third-order TT tensor decomposition achieves up to

$3.34times$

and

$6.91times$

speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of

$5.01times$

over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of

$65.3times$

at the cost of

$0.3%$

drop in accuracy and a speedup of

$1.53times$

over a PyTorch implementation on CUDA cores. The optimized DMRG algorithm achieves up to a speedup of

$14.0times$

over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.

从大规模高维数据中学习张量-训练（TT）结构（又称矩阵乘积状态（MPS）表示）一直是大数据分析、深度学习和量子机器学习中的一项常见任务。然而，张量训练算法是计算密集型的，这阻碍了其在现实世界中的应用。在本文中，我们提出了使用 GPU 张量核的高性能张量-训练基元，并演示了三种应用。首先，我们利用 GPU 张量核优化张量训练基元，包括张量收缩、奇异值分解以及数据传输和计算。其次，我们利用优化后的基元加速大数据分析中的张量-列车分解算法。此外，我们还提出了在多个 GPU 上进行高阶张量计算的碎片模式。第三，我们应用优化基元加速张量-训练层，以压缩深度神经网络。最后，我们利用优化的基元加速了一种名为密度矩阵重归一化组（DMRG）的量子机器学习算法。在性能评估中，我们的三阶张量分解在A100 GPU上比两个流行库（即T3F和tntorch）分别提高了3.34美元和6.91美元。在多个A100 GPU上，拟议的六阶张量-列车分解比T3F最多提高了5.01（times$）美元。我们为全连接神经网络设计的张量-训练层实现了65.3倍的压缩率，其代价是精度下降了0.3%，与CUDA内核上的PyTorch实现相比，速度提高了1.53倍。优化后的DMRG算法比TensorNetwork的速度提高了14.0times$，这表明优化后的张量基元在量子机器学习算法的经典模拟方面具有潜力。

{"title":"High-Performance Tensor-Train Primitives Using GPU Tensor Cores","authors":"Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid","doi":"10.1109/TC.2024.3441831","DOIUrl":"10.1109/TC.2024.3441831","url":null,"abstract":"Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called \u0000Density Matrix Renormalization Group (DMRG)\u0000. In performance evaluations, our third-order TT tensor decomposition achieves up to \u0000<inline-formula><tex-math>$3.34times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$6.91times$</tex-math></inline-formula>\u0000 speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of \u0000<inline-formula><tex-math>$5.01times$</tex-math></inline-formula>\u0000 over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of \u0000<inline-formula><tex-math>$65.3times$</tex-math></inline-formula>\u0000 at the cost of \u0000<inline-formula><tex-math>$0.3%$</tex-math></inline-formula>\u0000 drop in accuracy and a speedup of \u0000<inline-formula><tex-math>$1.53times$</tex-math></inline-formula>\u0000 over a PyTorch implementation on CUDA cores. The optimized \u0000DMRG\u0000 algorithm achieves up to a speedup of \u0000<inline-formula><tex-math>$14.0times$</tex-math></inline-formula>\u0000 over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2634-2648"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing Edge-MPQ：为边缘计算配备紧密集成的多功能推理单元的分层混合精度量化技术

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441860

Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo

As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of

$15.50times$

to

$47.67times$

over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves

$2.2%sim 6.7%$

higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over

$1.3%$

compared to a greedy-based search.

作为目前流行的深度神经网络压缩技术之一，层智混合精度量化（MPQ）比统一量化方案在精度和效率之间取得了更好的平衡。然而，现有的 MPQ 策略要么缺乏硬件意识，要么会产生巨大的计算成本，从而限制了它们在边缘的部署。此外，研究人员通常根据量化位宽或硬件要求在训练后量化（PTQ）和量化感知训练（QAT）之间做出一次性决定。在本文中，我们提出通过微体系结构和指令集体系结构（ISA）协同设计，将支持 INT2INT8 和 INT16 精确度的多功能 MPQ 推理单元紧密集成到 RISC-V 处理器流水线中。该设计采用 14nm 技术合成，在运行单卷积层内核时，比基准 RV64IMA 内核的速度提高了 15.50 美元到 47.67 美元，并实现了高达 2.86 GOPS 的性能。这项工作还实现了 20.51 TOPS/W 的能效，不仅超越了当代最先进的边缘 MPQ 硬件解决方案，而且标志着该领域的重大进步。我们还提出了一种新颖的 MPQ 搜索算法，该算法结合了硬件感知和训练必要性。该算法使用一组新提出的指标对各层敏感度进行采样，并运行启发式搜索。评估结果表明，与最先进的 MPQ 策略相比，这种搜索算法在类似的硬件限制条件下实现了更高的推理准确率。此外，我们还使用动态编程（DP）策略扩展了搜索空间，以更细粒度的精度区间进行搜索，并支持多维搜索。与基于贪婪的搜索相比，这进一步提高了推理精度超过 1.3%/$。

{"title":"Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing","authors":"Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo","doi":"10.1109/TC.2024.3441860","DOIUrl":"10.1109/TC.2024.3441860","url":null,"abstract":"As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of \u0000<inline-formula><tex-math>$15.50times$</tex-math></inline-formula>\u0000 to \u0000<inline-formula><tex-math>$47.67times$</tex-math></inline-formula>\u0000 over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves \u0000<inline-formula><tex-math>$2.2%sim 6.7%$</tex-math></inline-formula>\u0000 higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over \u0000<inline-formula><tex-math>$1.3%$</tex-math></inline-formula>\u0000 compared to a greedy-based search.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2504-2519"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Mutual-Influence-Aware Heuristic Method for Quantum Circuit Mapping 量子电路映射的相互影响启发法

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441825

Kui Ye;Shengxin Dai;Bing Guo;Yan Shen;Chuanjie Liu;Kejun Bi;Fei Chen;Yuchuan Hu;Mingjie Zhao

Quantum circuit mapping (QCM) is a crucial preprocessing step for executing a logical circuit (LC) on noisy intermediate-scale quantum (NISQ) devices. Balancing the introduction of extra gates and the efficiency of preprocessing poses a significant challenge for the mapping process. To address this challenge, we propose the mutual-influence-aware (MIA) heuristic method by integrating an initial mapping search framework, an initial mapping generator, and a heuristic circuit mapper. Initially, the framework utilizes the generator to obtain a favorable starting point for the initial mapping search. With this starting point, the search process can efficiently discover a promising initial mapping within a few bidirectional iterations. The circuit mapper considers mutual influences of SWAP gates and is invoked once per iteration. Ultimately, the best result from all iterations is considered the QCM outcome. The experimental results on extensive benchmark circuits demonstrate that, compared to the iterated local search (ILS) method, which represents the current state-of-the-art, our MIA method introduces a similar number of extra gates while achieving nearly 95 times faster execution.

量子电路映射（QCM）是在噪声中等规模量子（NISQ）器件上执行逻辑电路（LC）的关键预处理步骤。如何在引入额外门电路和提高预处理效率之间取得平衡，是映射过程面临的一项重大挑战。为了应对这一挑战，我们提出了相互影响感知（MIA）启发式方法，将初始映射搜索框架、初始映射生成器和启发式电路映射器整合在一起。最初，该框架利用生成器为初始映射搜索获得一个有利的起点。有了这个起点，搜索过程就能在几次双向迭代中高效地发现有希望的初始映射。电路映射器考虑了 SWAP 门的相互影响，每次迭代调用一次。最终，所有迭代的最佳结果被视为 QCM 结果。在大量基准电路上的实验结果表明，与代表当前最先进水平的迭代局部搜索（ILS）方法相比，我们的 MIA 方法引入的额外门数量相近，但执行速度却快了近 95 倍。

{"title":"A Mutual-Influence-Aware Heuristic Method for Quantum Circuit Mapping","authors":"Kui Ye;Shengxin Dai;Bing Guo;Yan Shen;Chuanjie Liu;Kejun Bi;Fei Chen;Yuchuan Hu;Mingjie Zhao","doi":"10.1109/TC.2024.3441825","DOIUrl":"10.1109/TC.2024.3441825","url":null,"abstract":"Quantum circuit mapping (QCM) is a crucial preprocessing step for executing a logical circuit (LC) on noisy intermediate-scale quantum (NISQ) devices. Balancing the introduction of extra gates and the efficiency of preprocessing poses a significant challenge for the mapping process. To address this challenge, we propose the mutual-influence-aware (MIA) heuristic method by integrating an initial mapping search framework, an initial mapping generator, and a heuristic circuit mapper. Initially, the framework utilizes the generator to obtain a favorable starting point for the initial mapping search. With this starting point, the search process can efficiently discover a promising initial mapping within a few bidirectional iterations. The circuit mapper considers mutual influences of SWAP gates and is invoked once per iteration. Ultimately, the best result from all iterations is considered the QCM outcome. The experimental results on extensive benchmark circuits demonstrate that, compared to the iterated local search (ILS) method, which represents the current state-of-the-art, our MIA method introduces a similar number of extra gates while achieving nearly 95 times faster execution.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2855-2867"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Response-Time Analysis of Bundled Gang Tasks Under Partitioned FP Scheduling 分区 FP 调度下捆绑帮派任务的响应时间分析

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441823

Veronica Rispo;Federico Aromolo;Daniel Casini;Alessandro Biondi

The study of parallel task models for real-time systems has become fundamental due to the increasing computational demand of modern applications. Recently, gang scheduling has gained attention for improving performance in tightly synchronized parallel applications. Nevertheless, existing studies often overestimate computational demand by assuming a constant number of cores for each task. In contrast, the bundled model accurately represents internal parallelism by means of a string of segments demanding for a variable number of cores. This model is particularly relevant to modern real-time systems, as it allows transforming general parallel tasks into bundled tasks while preserving accurate parallelism. However, it has only been analyzed for global scheduling, which carries analytical pessimism and considerable run-time overheads. This paper introduces two response-time analysis techniques for parallel real-time tasks under partitioned, fixed-priority gang scheduling under the bundled model, together with a set of specialized allocation heuristics. Experimental results compare the proposed methods against state-of-the-art approaches.

由于现代应用的计算需求不断增加，对实时系统并行任务模型的研究已成为基础。最近，帮派调度在提高紧密同步并行应用的性能方面受到了关注。然而，现有研究往往假设每个任务的内核数量不变，从而高估了计算需求。与此相反，捆绑模型通过要求可变内核数的段串，准确地表示了内部并行性。该模型与现代实时系统尤为相关，因为它可以将一般并行任务转化为捆绑任务，同时保留精确的并行性。然而，它只针对全局调度进行过分析，这带来了分析上的悲观和相当大的运行时开销。本文介绍了捆绑模型下分区、固定优先级帮派调度下并行实时任务的两种响应时间分析技术，以及一套专门的分配启发式方法。实验结果将所提出的方法与最先进的方法进行了比较。

{"title":"Response-Time Analysis of Bundled Gang Tasks Under Partitioned FP Scheduling","authors":"Veronica Rispo;Federico Aromolo;Daniel Casini;Alessandro Biondi","doi":"10.1109/TC.2024.3441823","DOIUrl":"10.1109/TC.2024.3441823","url":null,"abstract":"The study of parallel task models for real-time systems has become fundamental due to the increasing computational demand of modern applications. Recently, gang scheduling has gained attention for improving performance in tightly synchronized parallel applications. Nevertheless, existing studies often overestimate computational demand by assuming a constant number of cores for each task. In contrast, the bundled model accurately represents internal parallelism by means of a string of segments demanding for a variable number of cores. This model is particularly relevant to modern real-time systems, as it allows transforming general parallel tasks into bundled tasks while preserving accurate parallelism. However, it has only been analyzed for global scheduling, which carries analytical pessimism and considerable run-time overheads. This paper introduces two response-time analysis techniques for parallel real-time tasks under partitioned, fixed-priority gang scheduling under the bundled model, together with a set of specialized allocation heuristics. Experimental results compare the proposed methods against state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2534-2547"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10633880","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Generation and Optimization Framework of NoC-Based Neural Network Accelerator Through Reinforcement Learning 通过强化学习自动生成和优化基于 NoC 的神经网络加速器框架

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441822

Yongqi Xue;Jinlun Ji;Xinming Yu;Shize Zhou;Siyue Li;Xinyi Li;Tong Cheng;Shiping Li;Kai Chen;Zhonghai Lu;Li Li;Yuxiang Fu

Choices of dataflows, which are known as intra-core neural network (NN) computation loop nest scheduling and inter-core hardware mapping strategies, play a critical role in the performance and energy efficiency of NoC-based neural network accelerators. Confronted with an enormous dataflow exploration space, this paper proposes an automatic framework for generating and optimizing the full-layer-mappings based on two reinforcement learning algorithms including A2C and PPO. Combining soft and hard constraints, this work transforms the mapping configuration into a sequential decision problem and aims to explore the performance and energy efficient hardware mapping for NoC systems. We evaluate the performance of the proposed framework on 10 experimental neural networks. The results show that compared with the direct-X mapping, the direct-Y mapping, GA-base mapping, and NN-aware mapping, our optimization framework reduces the average execution time of 10 experimental NNs by 9.09

$%$

, improves the throughput by 11.27

$%$

, reduces the energy by 12.62

$%$

, and reduces the time-energy-product (TEP) by 14.49

$%$

. The results also show that the performance enhancement is related to the coefficient of variation of the neural network to be computed.

数据流的选择（即内核内神经网络（NN）计算环巢调度和内核间硬件映射策略）对基于 NoC 的神经网络加速器的性能和能效起着至关重要的作用。面对巨大的数据流探索空间，本文基于 A2C 和 PPO 两种强化学习算法，提出了一种自动生成和优化全层映射的框架。结合软约束和硬约束，这项工作将映射配置转化为一个顺序决策问题，旨在探索 NoC 系统的性能和能效硬件映射。我们在 10 个实验性神经网络上评估了拟议框架的性能。结果表明，与直接-X映射、直接-Y映射、基于GA的映射和神经网络感知映射相比，我们的优化框架将10个实验神经网络的平均执行时间缩短了9.09美元/%美元，将吞吐量提高了11.27美元/%美元，将能耗降低了12.62美元/%美元，将时间-能耗-产品（TEP）降低了14.49美元/%美元。结果还表明，性能提升与待计算神经网络的变异系数有关。

{"title":"Automatic Generation and Optimization Framework of NoC-Based Neural Network Accelerator Through Reinforcement Learning","authors":"Yongqi Xue;Jinlun Ji;Xinming Yu;Shize Zhou;Siyue Li;Xinyi Li;Tong Cheng;Shiping Li;Kai Chen;Zhonghai Lu;Li Li;Yuxiang Fu","doi":"10.1109/TC.2024.3441822","DOIUrl":"10.1109/TC.2024.3441822","url":null,"abstract":"Choices of dataflows, which are known as intra-core neural network (NN) computation loop nest scheduling and inter-core hardware mapping strategies, play a critical role in the performance and energy efficiency of NoC-based neural network accelerators. Confronted with an enormous dataflow exploration space, this paper proposes an automatic framework for generating and optimizing the full-layer-mappings based on two reinforcement learning algorithms including A2C and PPO. Combining soft and hard constraints, this work transforms the mapping configuration into a sequential decision problem and aims to explore the performance and energy efficient hardware mapping for NoC systems. We evaluate the performance of the proposed framework on 10 experimental neural networks. The results show that compared with the direct-X mapping, the direct-Y mapping, GA-base mapping, and NN-aware mapping, our optimization framework reduces the average execution time of 10 experimental NNs by 9.09\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000, improves the throughput by 11.27\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000, reduces the energy by 12.62\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000, and reduces the time-energy-product (TEP) by 14.49\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000. The results also show that the performance enhancement is related to the coefficient of variation of the neural network to be computed.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2882-2896"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Container Scheduling With Fast Function Startup and Low Memory Cost in Edge Computing 边缘计算中具有快速功能启动和低内存成本的在线容器调度

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-08-12 DOI: 10.1109/TC.2024.3441836

Zhenzheng Li;Jiong Lou;Jianfei Wu;Jianxiong Guo;Zhiqing Tang;Ping Shen;Weijia Jia;Wei Zhao

Extending serverless computing to the edge has emerged as a promising approach to support service, but startup containerized serverless functions lead to the cold-start delay. Recent research has introduced container caching methods to alleviate the cold-start delay, including cache as the entire container or the Zygote container. However, container caching incurs memory costs. The system must ensure fast function startup and low memory cost of edge servers, which has been overlooked in the literature. This paper aims to jointly optimize startup delay and memory cost. We formulate an online joint optimization problem that encompasses container scheduling decisions, including invocation distribution, container startup, and container caching. To solve the problem, we propose an online algorithm with a competitive ratio and low computational complexity. The proposed algorithm decomposes the problem into two subproblems and solves them sequentially. Each container is assigned a randomized strategy, and these container-level decisions are merged to constitute overall container caching decisions. Furthermore, a greedy-based subroutine is designed to solve the subproblem associated with invocation distribution and container startup decisions. Experiments on the real-world dataset indicate that the algorithm can reduce average startup delay by up to 23% and lower memory costs by up to 15%.

将无服务器计算扩展到边缘已成为支持服务的一种有前途的方法，但启动容器化的无服务器功能会导致冷启动延迟。最近的研究引入了容器缓存方法来缓解冷启动延迟，包括将整个容器或Zygote容器作为缓存。不过，容器缓存会产生内存成本。系统必须确保边缘服务器的快速功能启动和低内存成本，而这一点在文献中一直被忽视。本文旨在联合优化启动延迟和内存成本。我们提出了一个在线联合优化问题，其中包含容器调度决策，包括调用分布、容器启动和容器缓存。为了解决这个问题，我们提出了一种在线算法，该算法具有极高的竞争力和较低的计算复杂度。所提算法将问题分解为两个子问题，并依次求解。为每个容器分配一个随机策略，然后将这些容器级决策合并，构成整体容器缓存决策。此外，还设计了一个基于贪婪的子程序来解决与调用分配和容器启动决策相关的子问题。在实际数据集上的实验表明，该算法可将平均启动延迟减少 23%，内存成本降低 15%。

{"title":"Online Container Scheduling With Fast Function Startup and Low Memory Cost in Edge Computing","authors":"Zhenzheng Li;Jiong Lou;Jianfei Wu;Jianxiong Guo;Zhiqing Tang;Ping Shen;Weijia Jia;Wei Zhao","doi":"10.1109/TC.2024.3441836","DOIUrl":"10.1109/TC.2024.3441836","url":null,"abstract":"Extending serverless computing to the edge has emerged as a promising approach to support service, but startup containerized serverless functions lead to the cold-start delay. Recent research has introduced container caching methods to alleviate the cold-start delay, including cache as the entire container or the Zygote container. However, container caching incurs memory costs. The system must ensure fast function startup and low memory cost of edge servers, which has been overlooked in the literature. This paper aims to jointly optimize startup delay and memory cost. We formulate an online joint optimization problem that encompasses container scheduling decisions, including invocation distribution, container startup, and container caching. To solve the problem, we propose an online algorithm with a competitive ratio and low computational complexity. The proposed algorithm decomposes the problem into two subproblems and solves them sequentially. Each container is assigned a randomized strategy, and these container-level decisions are merged to constitute overall container caching decisions. Furthermore, a greedy-based subroutine is designed to solve the subproblem associated with invocation distribution and container startup decisions. Experiments on the real-world dataset indicate that the algorithm can reduce average startup delay by up to 23% and lower memory costs by up to 15%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2747-2760"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Statistical Higher-Order Correlation Attacks Against Code-Based Masking 针对基于代码的掩码的统计高阶相关性攻击

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-07-05 DOI: 10.1109/TC.2024.3424208

Wei Cheng;Jingdian Ming;Sylvain Guilley;Jean-Luc Danger

Masking is one of the most well-established methods to thwart side-channel attacks. Many masking schemes have been proposed in the literature, and code-based masking emerges and unifies several masking schemes in a coding-theoretic framework. In this work, we investigate the side-channel resistance of code-based masking from a non-profiling perspective by utilizing correlation-based side-channel attacks. We present a systematic evaluation of correlation attacks with various higher-order (centered) moments and then present the form of optimal correlation attacks. Interestingly, the Pearson correlation coefficient between the hypothetical leakage and the measured traces is connected to the signal-to-noise ratio in higher-order moments, and it turns out to be easy to evaluate rather than launch repeated attacks. We also identify some ineffective higher-order correlation attacks at certain orders when the device leaks under the Hamming weight leakage model. Our theoretical findings are verified through both simulated and real-world measurements.

掩码是挫败侧信道攻击最行之有效的方法之一。文献中提出了许多掩码方案，而基于编码的掩码是在编码理论框架下出现并统一了几种掩码方案。在这项工作中，我们利用基于相关性的侧信道攻击，从非伪装的角度研究了基于代码的掩蔽的抗侧信道能力。我们用各种高阶（居中）矩对相关性攻击进行了系统评估，然后提出了最优相关性攻击的形式。有趣的是，假设泄漏与测量迹线之间的皮尔逊相关系数与高阶时刻的信噪比有关，因此很容易评估而不是发起重复攻击。在汉明权重泄漏模型下，当设备发生泄漏时，我们还确定了某些阶次的无效高阶相关性攻击。我们的理论发现通过模拟和实际测量得到了验证。

引用次数: 0