首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
Serving MoE Models on Resource-Constrained Edge Devices via Dynamic Expert Swapping 基于动态专家交换的资源受限边缘设备MoE模型服务
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575905
Rui Kong;Yuanchun Li;Weijun Wang;Linghe Kong;Yunxin Liu
Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.
混合专家(MoE)是深度学习中的一种流行技术,它利用条件激活的并行神经网络模块(专家)来提高模型的能力。然而,由于模型大小和复杂性的显著增加,在资源受限的延迟关键边缘场景中提供MoE模型是具有挑战性的。在本文中,我们首先分析了连续推理场景下MoE模型的行为模式,从而得出了专家激活的三个关键观察结果,包括时间局部性、可交换性和可跳过计算。基于这些观察,我们引入了PC-MoE推理框架,用于资源受限的连续MoE模型服务。PC-MoE的核心是一个新的数据结构,参数委员会,它智能地维护一个重要的专家子集,以减少资源消耗。为了评估PC-MoE的有效性,我们使用最先进的MoE模型对常见的计算机视觉和自然语言处理任务进行了实验。结果表明PC-MoE实现了资源消耗和模型精度之间的最佳权衡。例如,在使用swan - moe模型的目标检测任务上,我们的方法可以将内存使用和延迟分别减少42.34%和18.63%,而准确率仅下降0.10%。
{"title":"Serving MoE Models on Resource-Constrained Edge Devices via Dynamic Expert Swapping","authors":"Rui Kong;Yuanchun Li;Weijun Wang;Linghe Kong;Yunxin Liu","doi":"10.1109/TC.2025.3575905","DOIUrl":"https://doi.org/10.1109/TC.2025.3575905","url":null,"abstract":"Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, <italic>Parameter Committee</i>, that intelligently maintains a subset of important experts in use to reduce resource consumption. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2799-2811"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual Fast-Track Cache: Organizing Ring-Shaped Racetracks to Work as L1 Caches 双快速通道缓存:组织环形赛道工作作为L1缓存
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575909
Alejandro Valero;Vicente Lorente;Salvador Petit;Julio Sahuquillo
Static Random-Access Memory (SRAM) is the fastest memory technology and has been the common design choice for implementing first-level (L1) caches in the processor pipeline, where speed is a key design issue that must be fulfilled. On the contrary, this technology offers much lower density compared to other technologies like Dynamic RAM, limiting L1 cache sizes of modern processors to a few tens of KB. This paper explores the use of slower but denser Domain Wall Memory (DWM) technology for L1 caches. This technology provides slow access times since it arranges multiple bits sequentially in a magnetic racetrack. To access these bits, they need to be shifted in order to place them under a header. A 1-bit shift usually takes one processor cycle, which can significantly hurt the application performance, making this working behavior inappropriate for L1 caches. Based on the locality (temporal and spatial) principles exploited by caches, this work proposes the Dual Fast-Track Cache (Dual FTC) design, a new approach to organizing a set of racetracks to build set-associative caches. Compared to a conventional SRAM cache, Dual FTC enhances storage capacity by 5× while incurring minimal shifting overhead, thereby rendering it a practical and appealing solution for L1 cache implementations. Experimental results show that the devised cache organization is as fast as an SRAM cache for 78% and 86% of the L1 data cache hits and L1 instruction cache hits, respectively (i.e., no shift is required). Consequently, due to the larger L1 cache capacities, significant system performance gains (by 22% on average) are obtained under the same silicon area.
静态随机存取存储器(SRAM)是最快的存储器技术,是处理器流水线中实现一级(L1)缓存的常用设计选择,其中速度是必须满足的关键设计问题。相反,与动态RAM等其他技术相比,该技术提供的密度要低得多,将现代处理器的L1缓存大小限制在几十KB。本文探讨了在L1缓存中使用较慢但更密集的域壁内存(DWM)技术。这种技术提供了较慢的访问时间,因为它将多个比特顺序排列在磁跑道上。为了访问这些位,需要对它们进行移位,以便将它们置于标题下。1位移位通常需要一个处理器周期,这可能会严重损害应用程序性能,使这种工作行为不适合L1缓存。基于缓存利用的局部性(时间和空间)原则,本工作提出了双快速通道缓存(Dual Fast-Track Cache, FTC)设计,这是一种组织一组赛道来构建集关联缓存的新方法。与传统的SRAM缓存相比,Dual FTC将存储容量提高了5倍,同时产生最小的移动开销,从而使其成为L1缓存实现的实用且吸引人的解决方案。实验结果表明,所设计的缓存组织在L1数据缓存命中率和L1指令缓存命中率分别为78%和86%时与SRAM缓存一样快(即不需要移位)。因此,由于L1缓存容量更大,在相同的硅面积下获得了显著的系统性能提升(平均提升22%)。
{"title":"Dual Fast-Track Cache: Organizing Ring-Shaped Racetracks to Work as L1 Caches","authors":"Alejandro Valero;Vicente Lorente;Salvador Petit;Julio Sahuquillo","doi":"10.1109/TC.2025.3575909","DOIUrl":"https://doi.org/10.1109/TC.2025.3575909","url":null,"abstract":"Static Random-Access Memory (SRAM) is the fastest memory technology and has been the common design choice for implementing first-level (L1) caches in the processor pipeline, where speed is a key design issue that must be fulfilled. On the contrary, this technology offers much lower density compared to other technologies like Dynamic RAM, limiting L1 cache sizes of modern processors to a few tens of KB. This paper explores the use of slower but denser Domain Wall Memory (DWM) technology for L1 caches. This technology provides slow access times since it arranges multiple bits sequentially in a magnetic racetrack. To access these bits, they need to be shifted in order to place them under a header. A 1-bit shift usually takes one processor cycle, which can significantly hurt the application performance, making this working behavior inappropriate for L1 caches. Based on the locality (temporal and spatial) principles exploited by caches, this work proposes the Dual Fast-Track Cache (Dual FTC) design, a new approach to organizing a set of racetracks to build set-associative caches. Compared to a conventional SRAM cache, Dual FTC enhances storage capacity by 5× while incurring minimal shifting overhead, thereby rendering it a practical and appealing solution for L1 cache implementations. Experimental results show that the devised cache organization is as fast as an SRAM cache for 78% and 86% of the L1 data cache hits and L1 instruction cache hits, respectively (i.e., no shift is required). Consequently, due to the larger L1 cache capacities, significant system performance gains (by 22% on average) are obtained under the same silicon area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2812-2826"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11022726","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ls-Stream: Lightening Stragglers in Join Operators for Skewed Data Stream Processing Ls-Stream:在倾斜数据流处理的连接算子中减轻掉队者
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575917
Minghui Wu;Dawei Sun;Shang Gao;Keqin Li;Rajkumar Buyya
Load imbalance can lead to the emergence of stragglers, i.e., join instances that significantly lag behind others in processing data streams. Currently, state-of-the-art solutions are capable of balancing the load between join instances to mitigate stragglers by managing hot keys and random partitioning. However, these solutions rely on either complicated routing strategies or resource-inefficient processing structures, making them susceptible to frequent changes in load between instances. Therefore, we present Ls-Stream, a data stream scheduler that aims to support dynamic workload assignment for join instances to lighten stragglers. This paper outlines our solution from the following aspects: (1) The models for partitioning, communication, matrix, and resource are developed, formalizing problems like imbalanced load between join instances and state migration costs. (2) Ls-Stream employs a two-level routing strategy for workload allocation by combining hash-based and key-based data partitioning, specifying the destination join instances for data tuples. (3) Ls-Stream also constructs a fine-grained model for minimizing the state migration cost. This allows us to make trade-offs between data transfer overhead and migration benefits. (4) Experimental results demonstrate significant improvements made by Ls-Stream: reducing maximum system latency by 49.3% and increasing maximum throughput by more than 2x compared to existing state-of-the-art works.
负载不平衡可能导致出现掉队者,即在处理数据流时明显落后于其他实例的连接实例。目前,最先进的解决方案能够平衡连接实例之间的负载,通过管理热键和随机分区来减少掉队者。然而,这些解决方案要么依赖于复杂的路由策略,要么依赖于资源效率低下的处理结构,这使得它们容易受到实例之间负载频繁变化的影响。因此,我们提出了Ls-Stream,这是一个数据流调度器,旨在支持连接实例的动态工作负载分配,以减轻掉队者。本文从以下几个方面概述了我们的解决方案:(1)建立了分区、通信、矩阵和资源的模型,形式化了连接实例之间负载不平衡和状态迁移成本等问题。(2) Ls-Stream通过结合基于哈希和基于键的数据分区,为数据元组指定目标连接实例,采用两级路由策略进行工作负载分配。(3) Ls-Stream还构建了一个细粒度模型来最小化状态迁移成本。这允许我们在数据传输开销和迁移收益之间进行权衡。(4)实验结果表明,与现有的最先进的工作相比,Ls-Stream取得了显著的改进:最大系统延迟降低了49.3%,最大吞吐量提高了2倍以上。
{"title":"Ls-Stream: Lightening Stragglers in Join Operators for Skewed Data Stream Processing","authors":"Minghui Wu;Dawei Sun;Shang Gao;Keqin Li;Rajkumar Buyya","doi":"10.1109/TC.2025.3575917","DOIUrl":"https://doi.org/10.1109/TC.2025.3575917","url":null,"abstract":"Load imbalance can lead to the emergence of stragglers, i.e., join instances that significantly lag behind others in processing data streams. Currently, state-of-the-art solutions are capable of balancing the load between join instances to mitigate stragglers by managing hot keys and random partitioning. However, these solutions rely on either complicated routing strategies or resource-inefficient processing structures, making them susceptible to frequent changes in load between instances. Therefore, we present Ls-Stream, a data stream scheduler that aims to support dynamic workload assignment for join instances to lighten stragglers. This paper outlines our solution from the following aspects: (1) The models for partitioning, communication, matrix, and resource are developed, formalizing problems like imbalanced load between join instances and state migration costs. (2) Ls-Stream employs a two-level routing strategy for workload allocation by combining hash-based and key-based data partitioning, specifying the destination join instances for data tuples. (3) Ls-Stream also constructs a fine-grained model for minimizing the state migration cost. This allows us to make trade-offs between data transfer overhead and migration benefits. (4) Experimental results demonstrate significant improvements made by Ls-Stream: reducing maximum system latency by 49.3% and increasing maximum throughput by more than 2x compared to existing state-of-the-art works.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2841-2855"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast Garbage Collection in Erasure-Coded Storage Clusters 擦除编码存储集群中的快速垃圾回收
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575914
Hai Zhou;Dan Feng;Yuchong Hu;Wei Wang;Huadong Huang
Erasure codes (EC) have been widely adopted to provide high data reliability with low storage costs in clusters. Due to the deletion and out-of-place update operations, some data blocks are invalid, which unfortunately arouses the tedious garbage collection (GC) problem. Several limitations still plague existing designs: substantial network traffic, unbalanced traffic load, and low read/write performance after GC. This paper proposes FastGC, a fast garbage collection method that merges the old stripes into a new stripe and reclaims invalid blocks. FastGC quickly generates an efficient merge solution by stripe grouping and bit sequences operations to minimize network traffic and maintains data block distributions of the same stripe to ensure read performance. It carefully allocates the storage space for new stripes during merging to eliminate the discontinuous free spaces that affect write performance. Furthermore, to accelerate the parity updates after merging, FastGC greedily schedules the transmission links for multi-stripe updates to balance the traffic load across nodes and adopts a maximum flow algorithm to saturate the bandwidth utilization. Comprehensive evaluation results show via simulations and Alibaba ECS experiments that FastGC can significantly reduce 10.36%-81.22% of the network traffic and 34.25%-72.36% of the GC time while maintaining read/write performance after GC.
Erasure code (EC)在集群中被广泛采用,以提供高的数据可靠性和低的存储成本。由于删除和错位更新操作,一些数据块是无效的,这不幸引起了繁琐的垃圾收集(GC)问题。一些限制仍然困扰着现有的设计:大量的网络流量、不平衡的流量负载和GC后较低的读写性能。FastGC是一种快速的垃圾回收方法,它将旧的条带合并成一个新的条带,并回收无效的块。FastGC通过分条分组和位序列操作快速生成高效的合并方案,最大限度地减少网络流量,同时保持相同分条的数据块分布,保证读性能。它在合并过程中仔细地为新条带分配存储空间,以消除影响写性能的不连续空闲空间。为了加快合并后校验更新的速度,FastGC贪婪地调度传输链路进行多分条更新,以平衡节点间的流量负载,并采用最大流量算法使带宽利用率达到饱和。通过仿真和阿里巴巴ECS实验综合评价结果表明,FastGC在保持GC后读写性能的同时,可以显著减少10.36% ~ 81.22%的网络流量和34.25% ~ 72.36%的GC时间。
{"title":"Fast Garbage Collection in Erasure-Coded Storage Clusters","authors":"Hai Zhou;Dan Feng;Yuchong Hu;Wei Wang;Huadong Huang","doi":"10.1109/TC.2025.3575914","DOIUrl":"https://doi.org/10.1109/TC.2025.3575914","url":null,"abstract":"<italic>Erasure codes</i> (EC) have been widely adopted to provide high data reliability with low storage costs in clusters. Due to the deletion and out-of-place update operations, some data blocks are invalid, which unfortunately arouses the tedious <italic>garbage collection</i> (GC) problem. Several limitations still plague existing designs: substantial network traffic, unbalanced traffic load, and low read/write performance after GC. This paper proposes FastGC, a fast garbage collection method that merges the old stripes into a new stripe and reclaims invalid blocks. FastGC quickly generates an efficient merge solution by stripe grouping and bit sequences operations to minimize network traffic and maintains data block distributions of the same stripe to ensure read performance. It carefully allocates the storage space for new stripes during merging to eliminate the discontinuous free spaces that affect write performance. Furthermore, to accelerate the parity updates after merging, FastGC greedily schedules the transmission links for multi-stripe updates to balance the traffic load across nodes and adopts a maximum flow algorithm to saturate the bandwidth utilization. Comprehensive evaluation results show via simulations and Alibaba ECS experiments that FastGC can significantly reduce 10.36%-81.22% of the network traffic and 34.25%-72.36% of the GC time while maintaining read/write performance after GC.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2827-2840"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DCAS-BMT: Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree for Performance Enhancement in Secure Non-Volatile Memory DCAS-BMT:用于安全非易失性存储器性能增强的倾斜盆景默克尔树的动态构建和调整
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-10 DOI: 10.1109/TC.2025.3558007
Yu Zhang;Renhai Chen;Hangyu Yan;Hongyue Wu;Zhiyong Feng
Traditional DRAM-based memory solutions face challenges, including high energy consumption and limited scalability. Non-Volatile Memory (NVM) offers low energy consumption and high scalability. However, security challenges, particularly data remanence vulnerabilities, persist. Prevalent methods such as the Bonsai Merkle Tree (BMT) are employed to ensure data security. However, the consistency requirements for integrity tree updates have led to performance issues. It is observed that compared to a secure NVM system without persistent secure metadata, the average overhead for updating and persisting the BMT root with persistent secure metadata is as high as 2.48 times. Therefore, this paper aims to mitigate these inefficiencies by leveraging the principle of memory access locality. We propose the Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree (DCAS-BMT). The DCAS-BMT is dynamically built and continuously adjusted at runtime according to access weights, ensuring frequently accessed memory blocks reside on shorter paths to the root node. This reduces the verification steps for frequently accessed memory blocks, thereby lowering the overall cost of memory authentication and updates. Experimental results using the USIMM memory simulator demonstrate that compared to the widely used BMT approach, the DCAS-BMT scheme shows a performance improvement of 34.1%.
传统的基于dram的内存解决方案面临挑战,包括高能耗和有限的可扩展性。NVM (Non-Volatile Memory)具有低能耗和高扩展性的特点。然而,安全挑战,特别是数据残留漏洞仍然存在。采用盆景默克尔树(BMT)等常用方法确保数据安全。然而,完整性树更新的一致性要求导致了性能问题。可以观察到,与没有持久安全元数据的安全NVM系统相比,使用持久安全元数据更新和持久化BMT根的平均开销高达2.48倍。因此,本文旨在通过利用内存访问局部性原则来减轻这些低效率。提出了倾斜盆景默克尔树(DCAS-BMT)的动态构建与调整。DCAS-BMT是动态构建的,并在运行时根据访问权值不断调整,确保频繁访问的内存块位于到根节点的较短路径上。这减少了频繁访问内存块的验证步骤,从而降低了内存验证和更新的总体成本。在USIMM存储器模拟器上的实验结果表明,与目前广泛使用的BMT方法相比,DCAS-BMT方案的性能提高了34.1%。
{"title":"DCAS-BMT: Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree for Performance Enhancement in Secure Non-Volatile Memory","authors":"Yu Zhang;Renhai Chen;Hangyu Yan;Hongyue Wu;Zhiyong Feng","doi":"10.1109/TC.2025.3558007","DOIUrl":"https://doi.org/10.1109/TC.2025.3558007","url":null,"abstract":"Traditional DRAM-based memory solutions face challenges, including high energy consumption and limited scalability. Non-Volatile Memory (NVM) offers low energy consumption and high scalability. However, security challenges, particularly data remanence vulnerabilities, persist. Prevalent methods such as the Bonsai Merkle Tree (BMT) are employed to ensure data security. However, the consistency requirements for integrity tree updates have led to performance issues. It is observed that compared to a secure NVM system without persistent secure metadata, the average overhead for updating and persisting the BMT root with persistent secure metadata is as high as 2.48 times. Therefore, this paper aims to mitigate these inefficiencies by leveraging the principle of memory access locality. We propose the Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree (DCAS-BMT). The DCAS-BMT is dynamically built and continuously adjusted at runtime according to access weights, ensuring frequently accessed memory blocks reside on shorter paths to the root node. This reduces the verification steps for frequently accessed memory blocks, thereby lowering the overall cost of memory authentication and updates. Experimental results using the USIMM memory simulator demonstrate that compared to the widely used BMT approach, the DCAS-BMT scheme shows a performance improvement of 34.1%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2183-2194"},"PeriodicalIF":3.6,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DCGG: A Dynamically Adaptive and Hardware-Software Coordinated Runtime System for GNN Acceleration on GPUs 基于gpu的GNN加速动态自适应软硬件协调运行时系统
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-09 DOI: 10.1109/TC.2025.3558042
Guoqing Xiao;Li Xia;Yuedan Chen;Hongyang Chen;Wangdong Yang
Graph neural networks (GNNs) are a prominent trend in graph-based deep learning, known for their capacity to produce high-quality node embeddings. However, the existing GNN framework design is only implemented from the algorithm level, and the hardware architecture of the GPU is not fully utilized. To this end, we propose DCGG, a dynamic runtime adaptive framework, which can accelerate various GNN workloads on GPU platforms. DCGG has carried out deeper optimization work mainly in terms of load balancing and software and hardware matching. Accordingly, three optimization strategies are proposed. First, we propose dynamic 2D workload management methods and perform customized optimization based on it, effectively reducing additional memory operations. Second, a new slicing strategy is adopted, combined with hardware features, to effectively improve the efficiency of data reuse. Third, DCGG uses the Quantitative Dimension Parallel Strategy to optimize dimensions and parallel methods, greatly improving load balance and data locality. Extensive experiments demonstrate that DCGG outperforms the state-of-the-art GNN computing frameworks, such as Deep Graph Library (up to 3.10$boldsymbol{times}$ faster) and GNNAdvisor (up to 2.80$boldsymbol{times}$ faster), on mainstream GNN architectures across various datasets.
图神经网络(gnn)是基于图的深度学习的一个突出趋势,以其产生高质量节点嵌入的能力而闻名。然而,现有的GNN框架设计仅从算法层面实现,并未充分利用GPU的硬件架构。为此,我们提出了一个动态运行时自适应框架DCGG,它可以在GPU平台上加速各种GNN工作负载。DCGG主要在负载均衡和软硬件匹配方面进行了更深入的优化工作。据此,提出了三种优化策略。首先,我们提出了动态二维工作负载管理方法,并在此基础上进行定制化优化,有效减少了额外的内存操作。其次,采用新的切片策略,结合硬件特性,有效提高数据复用效率;第三,DCGG采用定量维度并行策略对维度和并行方法进行优化,极大地提高了负载均衡性和数据局部性。广泛的实验表明,DCGG在各种数据集的主流GNN架构上优于最先进的GNN计算框架,如深度图库(高达3.10$boldsymbol{times}$快)和GNNAdvisor(高达2.80$boldsymbol{times}$快)。
{"title":"DCGG: A Dynamically Adaptive and Hardware-Software Coordinated Runtime System for GNN Acceleration on GPUs","authors":"Guoqing Xiao;Li Xia;Yuedan Chen;Hongyang Chen;Wangdong Yang","doi":"10.1109/TC.2025.3558042","DOIUrl":"https://doi.org/10.1109/TC.2025.3558042","url":null,"abstract":"Graph neural networks (GNNs) are a prominent trend in graph-based deep learning, known for their capacity to produce high-quality node embeddings. However, the existing GNN framework design is only implemented from the algorithm level, and the hardware architecture of the GPU is not fully utilized. To this end, we propose DCGG, a dynamic runtime adaptive framework, which can accelerate various GNN workloads on GPU platforms. DCGG has carried out deeper optimization work mainly in terms of load balancing and software and hardware matching. Accordingly, three optimization strategies are proposed. First, we propose dynamic 2D workload management methods and perform customized optimization based on it, effectively reducing additional memory operations. Second, a new slicing strategy is adopted, combined with hardware features, to effectively improve the efficiency of data reuse. Third, DCGG uses the Quantitative Dimension Parallel Strategy to optimize dimensions and parallel methods, greatly improving load balance and data locality. Extensive experiments demonstrate that DCGG outperforms the state-of-the-art GNN computing frameworks, such as Deep Graph Library (up to 3.10<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster) and GNNAdvisor (up to 2.80<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster), on mainstream GNN architectures across various datasets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2293-2305"},"PeriodicalIF":3.6,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SmartZone: Runtime Support for Secure and Efficient On-Device Inference on ARM TrustZone SmartZone:运行时支持在ARM TrustZone上安全高效的设备上推断
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3557971
Zhaolong Jian;Xu Liu;Qiankun Dong;Longkai Cheng;Xueshuo Xie;Tao Li
On-device inference is a burgeoning paradigm that performs model inference locally on end devices, allowing private data to remain local. ARM TrustZone as a widely supported trusted execution environment has been applied to provide confidentiality protection for on-device inference. However, with the rise of large-scale models like large language models (LLMs), TrustZone-based on-device inference faces challenges in migration difficulties and inefficient execution. The rudimentary TEE OS on TrustZone lacks both the inference runtime needed for building models and the parallel support necessary to accelerate inference. Moreover, the limited secure memory resources on end devices further constrain the model size and degrade performance. In this paper, we propose SmartZone to provide runtime support for secure and efficient on-device inference on TrustZone. SmartZone consists three main components: (1) a trusted inference-oriented operator set, providing the underlying mechanisms adapted to the TrustZone's execution mode for trusted inference of DNN models and LLMs. (2) the proactive multi-threading parallel support, which increases the number of CPU cores in the secure state via cross-world thread collaboration to achieve parallelism, and (3) the on-demand secure memory management method, which statically allocates the appropriate secure memory size based on pre-execution resource analysis. We implement a prototype of SmartZone on the Raspberry Pi 3B+ board and evaluate it on four well-known DNN models and llama2 LLM. Extensive experimental results show that SmartZone provides end-to-end protection for on-device inference while maintaining excellent performance. Compared to the origin trusted inference, SmartZone accelerates the inference speed by up to $4.26boldsymbol{times}$ and reduces energy consumption by $65.81%$.
设备上推理是一种新兴的范例,它在终端设备上本地执行模型推理,允许私有数据保持本地。ARM TrustZone作为一个被广泛支持的可信执行环境,已被应用于为设备上推断提供机密性保护。然而,随着大型语言模型(llm)等大型模型的兴起,基于trustzone的设备上推理面临迁移困难和执行效率低下的挑战。TrustZone上的基本TEE操作系统既缺乏构建模型所需的推理运行时,也缺乏加速推理所需的并行支持。此外,终端设备上有限的安全内存资源进一步限制了模型的大小,降低了性能。在本文中,我们提出了SmartZone来为TrustZone上安全高效的设备上推断提供运行时支持。SmartZone主要由三个部分组成:(1)面向可信推理的操作符集,为DNN模型和llm的可信推理提供适应TrustZone执行模式的底层机制。(2)主动多线程并行支持,通过跨世界线程协作增加处于安全状态的CPU核数,实现并行性;(3)按需安全内存管理方法,基于预执行资源分析,静态分配适当的安全内存大小。我们在树莓派3B+板上实现了SmartZone的原型,并在四种知名的DNN模型和llama2 LLM上对其进行了评估。大量的实验结果表明,SmartZone在保持优异性能的同时,为设备上的推理提供端到端保护。与原始可信推理相比,SmartZone的推理速度提高了4.26美元,能耗降低了65.81%。
{"title":"SmartZone: Runtime Support for Secure and Efficient On-Device Inference on ARM TrustZone","authors":"Zhaolong Jian;Xu Liu;Qiankun Dong;Longkai Cheng;Xueshuo Xie;Tao Li","doi":"10.1109/TC.2025.3557971","DOIUrl":"https://doi.org/10.1109/TC.2025.3557971","url":null,"abstract":"On-device inference is a burgeoning paradigm that performs model inference locally on end devices, allowing private data to remain local. ARM TrustZone as a widely supported trusted execution environment has been applied to provide confidentiality protection for on-device inference. However, with the rise of large-scale models like large language models (LLMs), TrustZone-based on-device inference faces challenges in migration difficulties and inefficient execution. The rudimentary TEE OS on TrustZone lacks both the inference runtime needed for building models and the parallel support necessary to accelerate inference. Moreover, the limited secure memory resources on end devices further constrain the model size and degrade performance. In this paper, we propose SmartZone to provide runtime support for secure and efficient on-device inference on TrustZone. SmartZone consists three main components: (1) a trusted inference-oriented operator set, providing the underlying mechanisms adapted to the TrustZone's execution mode for trusted inference of DNN models and LLMs. (2) the proactive multi-threading parallel support, which increases the number of CPU cores in the secure state via cross-world thread collaboration to achieve parallelism, and (3) the on-demand secure memory management method, which statically allocates the appropriate secure memory size based on pre-execution resource analysis. We implement a prototype of SmartZone on the Raspberry Pi 3B+ board and evaluate it on four well-known DNN models and llama2 LLM. Extensive experimental results show that SmartZone provides end-to-end protection for on-device inference while maintaining excellent performance. Compared to the origin trusted inference, SmartZone accelerates the inference speed by up to <inline-formula><tex-math>$4.26boldsymbol{times}$</tex-math></inline-formula> and reduces energy consumption by <inline-formula><tex-math>$65.81%$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2144-2158"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating RNA-Seq Quantification on a Real Processing-in-Memory System 在真实内存处理系统上加速RNA-Seq定量
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558075
Liang-Chi Chen;Chien-Chung Ho;Yuan-Hao Chang
Recently, with the growth of the required data size for emerging applications (e.g., graph processing and machine learning), the von Neumann bottleneck has become a main problem for restricting the throughput of the applications. To address the problem, an acceleration technique called Processing in Memory (PIM) has garnered attention due to its potential to reduce off-chip data movement between the processing unit (e.g., CPU) and memory device (e.g., DRAM). In 2019, UPMEM introduced the commercially available processing-in-memory product, the DRAM Processing Unit (DPU) [8], showing a new chance for accelerating data-intensive applications. Among data-intensive applications, RNA sequence (RNA-seq) quantification is used to measure the abundance of RNA sequences, and it also plays a critical role in the field of bioinformatics. We aim to leverage UPMEM DPU to accelerate RNA-seq Quantification. However, due to the DPU usage limitations caused by DPU hardware, there are some challenges to realizing RNA-seq Quantification on the DPU system. To overcome these challenges, we propose UpPipe, which consists of the DPU-friendly transcriptome allocation, the DPU-aware pipeline management, and the WRAM prefetching scheme. The UpPipe considers the hardware limitations of DPUs, enabling efficient sequence alignment even within the resource-constrained DPUs. The experimental results demonstrate the feasibility and efficiency of our proposed design. We also provide an evaluation study on the impact of data granularity selection on pipeline management and the optimal size for the WRAM prefetching scheme.
最近,随着新兴应用(如图处理和机器学习)所需数据量的增长,冯·诺伊曼瓶颈已成为限制应用吞吐量的主要问题。为了解决这个问题,一种被称为内存处理(PIM)的加速技术引起了人们的关注,因为它有可能减少处理单元(如CPU)和存储设备(如DRAM)之间的片外数据移动。2019年,UPMEM推出了商用内存处理产品DRAM处理单元(DPU)[8],为加速数据密集型应用提供了新的机会。在数据密集型应用中,RNA序列(RNA-seq)定量用于测量RNA序列的丰度,在生物信息学领域也起着至关重要的作用。我们的目标是利用UPMEM DPU加速RNA-seq定量。然而,由于DPU硬件对DPU使用的限制,在DPU系统上实现RNA-seq定量存在一些挑战。为了克服这些挑战,我们提出了由dpu友好转录组分配、dpu感知管道管理和WRAM预取方案组成的UpPipe。UpPipe考虑了dpu的硬件限制,即使在资源受限的dpu中也能实现有效的序列对齐。实验结果证明了该设计的可行性和有效性。我们还对数据粒度选择对管道管理的影响以及WRAM预取方案的最佳大小进行了评估研究。
{"title":"Accelerating RNA-Seq Quantification on a Real Processing-in-Memory System","authors":"Liang-Chi Chen;Chien-Chung Ho;Yuan-Hao Chang","doi":"10.1109/TC.2025.3558075","DOIUrl":"https://doi.org/10.1109/TC.2025.3558075","url":null,"abstract":"Recently, with the growth of the required data size for emerging applications (e.g., graph processing and machine learning), the von Neumann bottleneck has become a main problem for restricting the throughput of the applications. To address the problem, an acceleration technique called Processing in Memory (PIM) has garnered attention due to its potential to reduce off-chip data movement between the processing unit (e.g., CPU) and memory device (e.g., DRAM). In 2019, UPMEM introduced the commercially available processing-in-memory product, the DRAM Processing Unit (DPU) <xref>[8]</xref>, showing a new chance for accelerating data-intensive applications. Among data-intensive applications, RNA sequence (RNA-seq) quantification is used to measure the abundance of RNA sequences, and it also plays a critical role in the field of bioinformatics. We aim to leverage UPMEM DPU to accelerate RNA-seq Quantification. However, due to the DPU usage limitations caused by DPU hardware, there are some challenges to realizing RNA-seq Quantification on the DPU system. To overcome these challenges, we propose UpPipe, which consists of the DPU-friendly transcriptome allocation, the DPU-aware pipeline management, and the WRAM prefetching scheme. The UpPipe considers the hardware limitations of DPUs, enabling efficient sequence alignment even within the resource-constrained DPUs. The experimental results demonstrate the feasibility and efficiency of our proposed design. We also provide an evaluation study on the impact of data granularity selection on pipeline management and the optimal size for the WRAM prefetching scheme.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2334-2347"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Loss Recovery for Content Delivery Network 加速内容交付网络的损失恢复
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558020
Tong Li;Wei Liu;Xinyu Ma;Shuaipeng Zhu;Jingkun Cao;Duling Xu;Zhaoqi Yang;Senzhen Liu;Taotao Zhang;Yinfeng Zhu;Bo Wu;Kezhi Wang;Ke Xu
Packet losses significantly impact the user experience of content delivery network (CDN) services such as live streaming and data backup-and-archiving. However, our production network measurement studies show that the legacy loss recovery is far from satisfactory due to the wide-area loss characteristics (i.e., dynamics and burstiness) in the wild. In this paper, we propose a sender-side Adaptive ReTransmission scheme, ART, which minimizes the recovery time of lost packets with minimal redundancy cost. Distinguishing itself from forward-error-correction (FEC), which preemptively sends redundant data packets to prevent loss, ART functions as an automatic-repeat-request (ARQ) scheme. It applies redundancy specifically to lost packets instead of unlost packets, thereby addressing the characteristic patterns of wide-area losses in real-world scenarios. We implement ART upon QUIC protocol and evaluate it via both trace-driven emulation and real-world deployment. The results show that ART reduces up to 34% of flow completion time (FCT) for delay-sensitive transmissions, improves up to 26% of goodput for throughput-intensive transmissions, reduces 11.6% video playback rebuffering, and saves up to 90% of redundancy cost.
包丢失会严重影响内容交付网络(CDN)服务的用户体验,例如实时流媒体和数据备份和归档。然而,我们的生产网络测量研究表明,由于野外广域损失特征(即动态性和突发性),遗留损失的恢复远远不能令人满意。在本文中,我们提出了一种发送端自适应重传方案ART,该方案以最小的冗余代价最小化丢失数据包的恢复时间。与预先发送冗余数据包以防止丢失的前向纠错(FEC)不同,ART是一种自动重复请求(ARQ)方案。它专门将冗余应用于丢失的数据包,而不是未丢失的数据包,从而解决了现实场景中广域丢失的特征模式。我们在QUIC协议上实现ART,并通过跟踪驱动仿真和实际部署对其进行评估。结果表明,对于延迟敏感型传输,ART可减少高达34%的流完成时间(FCT),对于吞吐量密集型传输,可提高高达26%的goodput,减少11.6%的视频回放再缓冲,并节省高达90%的冗余成本。
{"title":"Accelerating Loss Recovery for Content Delivery Network","authors":"Tong Li;Wei Liu;Xinyu Ma;Shuaipeng Zhu;Jingkun Cao;Duling Xu;Zhaoqi Yang;Senzhen Liu;Taotao Zhang;Yinfeng Zhu;Bo Wu;Kezhi Wang;Ke Xu","doi":"10.1109/TC.2025.3558020","DOIUrl":"https://doi.org/10.1109/TC.2025.3558020","url":null,"abstract":"Packet losses significantly impact the user experience of content delivery network (CDN) services such as live streaming and data backup-and-archiving. However, our production network measurement studies show that the legacy loss recovery is far from satisfactory due to the wide-area loss characteristics (i.e., dynamics and burstiness) in the wild. In this paper, we propose a sender-side Adaptive ReTransmission scheme, ART, which minimizes the recovery time of lost packets with minimal redundancy cost. Distinguishing itself from forward-error-correction (FEC), which preemptively sends redundant data packets to prevent loss, ART functions as an automatic-repeat-request (ARQ) scheme. It applies redundancy specifically to lost packets instead of unlost packets, thereby addressing the characteristic patterns of wide-area losses in real-world scenarios. We implement ART upon QUIC protocol and evaluate it via both trace-driven emulation and real-world deployment. The results show that ART reduces up to 34% of flow completion time (FCT) for delay-sensitive transmissions, improves up to 26% of goodput for throughput-intensive transmissions, reduces 11.6% video playback rebuffering, and saves up to 90% of redundancy cost.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2223-2237"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CIMUS: 3D-Stacked Computing-in-Memory Under Image Sensor Architecture for Efficient Machine Vision 基于高效机器视觉的图像传感器架构下的内存3d堆叠计算
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558068
Lixia Han;Yiyang Chen;Siyuan Chen;Haozhang Yang;Ao Shi;Guihai Yu;Jiaqi Li;Zheng Zhou;Yijiao Wang;Yanzhi Wang;Xiaoyan Liu;Jinfeng Kang;Peng Huang
Computational image sensors with CNN processing capabilities are emerging to alleviate the energy-intensive and time-consuming data movement between sensors and external processors. However, deploying CNN models onto these computational image sensors faces challenges from the limited on-chip memory resources and insufficient image processing throughput. This work proposes a 3D-stacked NAND flash-based computing-in-memory under image sensor architecture (CIMUS) to facilitate the complete deployment of CNN model. To fully leverage the potential of high bandwidth from the 3D-stacked integration, we design a novel distributed CNN mapping and dataflow to process the full focal plane image in parallel, which senses and recognizes ImageNet tasks with >1000fps. To tackle the computational error of inputs “0” in 3D NAND flash-based CIM, we propose an input-independent offset compensation method, which reduces the average vector-matrix multiplication (VMM) error by 48%. Evaluation results indicate that CIMUS architecture achieves a 9.8× improvement in CNN inference speed and a 33× boost in energy efficiency compared to the state-of-the-art computational image sensor in the ImageNet recognition task.
具有CNN处理能力的计算图像传感器正在兴起,以减轻传感器与外部处理器之间能量密集型和耗时的数据传输。然而,将CNN模型部署到这些计算图像传感器上面临着片上存储资源有限和图像处理吞吐量不足的挑战。本工作提出了一种基于3d堆叠NAND闪存的图像传感器架构(CIMUS)下的内存计算,以促进CNN模型的完整部署。为了充分利用3d堆叠集成的高带宽潜力,我们设计了一种新的分布式CNN映射和数据流来并行处理全焦平面图像,以>1000fps的速度感知和识别ImageNet任务。为了解决基于3D NAND闪存的CIM中输入“0”的计算误差,提出了一种与输入无关的偏移补偿方法,该方法将平均向量矩阵乘法(VMM)误差降低了48%。评估结果表明,在ImageNet识别任务中,与最先进的计算图像传感器相比,CIMUS架构在CNN推理速度上提高了9.8倍,在能源效率上提高了33倍。
{"title":"CIMUS: 3D-Stacked Computing-in-Memory Under Image Sensor Architecture for Efficient Machine Vision","authors":"Lixia Han;Yiyang Chen;Siyuan Chen;Haozhang Yang;Ao Shi;Guihai Yu;Jiaqi Li;Zheng Zhou;Yijiao Wang;Yanzhi Wang;Xiaoyan Liu;Jinfeng Kang;Peng Huang","doi":"10.1109/TC.2025.3558068","DOIUrl":"https://doi.org/10.1109/TC.2025.3558068","url":null,"abstract":"Computational image sensors with CNN processing capabilities are emerging to alleviate the energy-intensive and time-consuming data movement between sensors and external processors. However, deploying CNN models onto these computational image sensors faces challenges from the limited on-chip memory resources and insufficient image processing throughput. This work proposes a 3D-stacked NAND flash-based computing-in-memory under image sensor architecture (CIMUS) to facilitate the complete deployment of CNN model. To fully leverage the potential of high bandwidth from the 3D-stacked integration, we design a novel distributed CNN mapping and dataflow to process the full focal plane image in parallel, which senses and recognizes ImageNet tasks with >1000fps. To tackle the computational error of inputs “0” in 3D NAND flash-based CIM, we propose an input-independent offset compensation method, which reduces the average vector-matrix multiplication (VMM) error by 48%. Evaluation results indicate that CIMUS architecture achieves a 9.8× improvement in CNN inference speed and a 33× boost in energy efficiency compared to the state-of-the-art computational image sensor in the ImageNet recognition task.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2321-2333"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1