IEEE Transactions on Computers最新文献_第9页

A High-Intensity Solution of Hardware Accelerator for Sparse and Redundant Computations in Semantic Segmentation Models 语义分割模型中稀疏冗余计算的硬件加速器高强度解决方案

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-07-02 DOI: 10.1109/TC.2025.3585354

Jiahui Huang;Zhan Li;Yuxian Jiang;Zhihan Zhang;Hao Wang;Sheng Chang

The rapid development of artificial intelligence (AI) has met people's personalized needs. However, with the increase of data capacities and computing requirements, the imbalance between large-scale data transmission and limited network bandwidth has become increasingly prominent. To improve the speed of embedded system, real-time intelligent computing is gradually moving from the cloud to the edge. Traditional FPGA-based AI accelerators mainly utilize PE architecture, but the low computing throughput and resource utilization make it difficult to meet the power requirement of edge AI application scenarios such as image segmentation. In recent years, AI accelerators based on streaming architecture have become a trend, and it is necessary to customize high-performance streaming accelerators for specific segmentation algorithms. In this paper, we design a high-intensity pixel-level fully pipelined accelerator with customized strategies to eliminate the sparse and redundant computations in specific algorithms of semantic segmentation, which significantly improve the accelerator's computing throughput and hardware resources utilization. On Xilinx FPGA, our acceleration of two typical semantic segmentation networks-ESPNet and DeepLabV3, achieves optimized throughputs of 171.3 GOPS and 1324.8 GOPS, and computing efficiency of 9.26 and 9.01, respectively. It provides the possibility of hardware deployment in real-time application with high computing intensity.

人工智能（AI）的快速发展满足了人们的个性化需求。然而，随着数据容量和计算需求的增加，大规模数据传输与有限的网络带宽之间的不平衡问题日益突出。为了提高嵌入式系统的运行速度，实时智能计算正逐步从云端向边缘移动。传统的基于fpga的AI加速器主要采用PE架构，但计算吞吐量和资源利用率较低，难以满足图像分割等边缘AI应用场景的功耗需求。近年来，基于流架构的AI加速器已成为一种趋势，有必要针对特定的分割算法定制高性能的流加速器。本文设计了一种高强度像素级全流水线加速器，采用定制化策略消除了语义分割特定算法中的稀疏计算和冗余计算，显著提高了加速器的计算吞吐量和硬件资源利用率。在Xilinx FPGA上，我们对espnet和DeepLabV3两种典型语义分割网络进行了加速，优化后的吞吐量分别为171.3 GOPS和1324.8 GOPS，计算效率分别为9.26和9.01。它为高计算强度的实时应用提供了硬件部署的可能性。

{"title":"A High-Intensity Solution of Hardware Accelerator for Sparse and Redundant Computations in Semantic Segmentation Models","authors":"Jiahui Huang;Zhan Li;Yuxian Jiang;Zhihan Zhang;Hao Wang;Sheng Chang","doi":"10.1109/TC.2025.3585354","DOIUrl":"https://doi.org/10.1109/TC.2025.3585354","url":null,"abstract":"The rapid development of artificial intelligence (AI) has met people's personalized needs. However, with the increase of data capacities and computing requirements, the imbalance between large-scale data transmission and limited network bandwidth has become increasingly prominent. To improve the speed of embedded system, real-time intelligent computing is gradually moving from the cloud to the edge. Traditional FPGA-based AI accelerators mainly utilize PE architecture, but the low computing throughput and resource utilization make it difficult to meet the power requirement of edge AI application scenarios such as image segmentation. In recent years, AI accelerators based on streaming architecture have become a trend, and it is necessary to customize high-performance streaming accelerators for specific segmentation algorithms. In this paper, we design a high-intensity pixel-level fully pipelined accelerator with customized strategies to eliminate the sparse and redundant computations in specific algorithms of semantic segmentation, which significantly improve the accelerator's computing throughput and hardware resources utilization. On Xilinx FPGA, our acceleration of two typical semantic segmentation networks-ESPNet and DeepLabV3, achieves optimized throughputs of 171.3 GOPS and 1324.8 GOPS, and computing efficiency of 9.26 and 9.01, respectively. It provides the possibility of hardware deployment in real-time application with high computing intensity.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3129-3142"},"PeriodicalIF":3.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reconfigurable Intelligent Surface Assisted UAV-MCS Based on Transformer Enhanced Deep Reinforcement Learning 基于变压器增强深度强化学习的可重构智能表面辅助无人机- mcs

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-07-02 DOI: 10.1109/TC.2025.3585361

Qianqian Wu;Qiang Liu;Ying He;Zefan Wu

Mobile crowd sensing (MCS) is an emerging paradigm that enables participants to collaborate on various sensing tasks. UAVs are increasingly integrated into MCS systems to provide more reliable, accurate and cost-effective sensing services. However, optimizing UAV trajectories and communication efficiency, especially under non-line-of-sight (NLoS) channel conditions, remains a significant challenge. This paper proposes TRAIL, a Transformer-enhanced deep reinforcement Learning (DRL) algorithm. TRAIL aims to jointly optimize UAV trajectories and Reconfigurable Intelligent Surface (RIS) phase shifts to maximize data throughput while minimizing UAV energy consumption. The optimization problem is modeled as a Markov Decision Process (MDP), where the Transformer architecture captures long-term dependencies in UAV trajectories, and these features are input into a Double Deep Q-Network with Prioritized Experience Replay (PER-DDQN) to guide the agent in learning the optimal strategy. Simulation results demonstrate that TRAIL significantly outperforms state-of-the-art methods in both data throughput and energy efficiency.

移动人群传感（MCS）是一种新兴的模式，它使参与者能够在各种传感任务上进行协作。无人机越来越多地集成到MCS系统中，以提供更可靠、更准确和更具成本效益的传感服务。然而，优化无人机轨迹和通信效率，特别是在非视距（NLoS）信道条件下，仍然是一个重大挑战。本文提出了一种变压器增强深度强化学习（DRL）算法TRAIL。TRAIL旨在共同优化无人机轨迹和可重构智能表面（RIS）相移，以最大化数据吞吐量，同时最小化无人机能耗。优化问题建模为马尔可夫决策过程（MDP），其中Transformer架构捕获无人机轨迹中的长期依赖关系，并将这些特征输入到具有优先体验重播（PER-DDQN）的Double Deep Q-Network中，以指导智能体学习最优策略。仿真结果表明，TRAIL在数据吞吐量和能源效率方面都明显优于最先进的方法。

{"title":"Reconfigurable Intelligent Surface Assisted UAV-MCS Based on Transformer Enhanced Deep Reinforcement Learning","authors":"Qianqian Wu;Qiang Liu;Ying He;Zefan Wu","doi":"10.1109/TC.2025.3585361","DOIUrl":"https://doi.org/10.1109/TC.2025.3585361","url":null,"abstract":"Mobile crowd sensing (MCS) is an emerging paradigm that enables participants to collaborate on various sensing tasks. UAVs are increasingly integrated into MCS systems to provide more reliable, accurate and cost-effective sensing services. However, optimizing UAV trajectories and communication efficiency, especially under non-line-of-sight (NLoS) channel conditions, remains a significant challenge. This paper proposes TRAIL, a Transformer-enhanced deep reinforcement Learning (DRL) algorithm. TRAIL aims to jointly optimize UAV trajectories and Reconfigurable Intelligent Surface (RIS) phase shifts to maximize data throughput while minimizing UAV energy consumption. The optimization problem is modeled as a Markov Decision Process (MDP), where the Transformer architecture captures long-term dependencies in UAV trajectories, and these features are input into a Double Deep Q-Network with Prioritized Experience Replay (PER-DDQN) to guide the agent in learning the optimal strategy. Simulation results demonstrate that TRAIL significantly outperforms state-of-the-art methods in both data throughput and energy efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3143-3155"},"PeriodicalIF":3.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MDC+: A Cooperative Approach to Memory-Efficient Fork-Based Checkpointing for In-Memory Database Systems MDC+：一种用于内存数据库系统的基于fork的内存高效检查点的协作方法

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-30 DOI: 10.1109/TC.2025.3584520

Cheolgi Min;Jiwoong Park;Heon Young Yeom;Hyungsoo Jung

Consistent checkpointing is a critical for in-memory databases (IMDBs) but its resource-intensive nature poses challenges for small- and medium-sized deployments in cloud environments, where memory utilization directly affects operational costs. Although traditional fork-based checkpointing offers merits in terms of performance and implementation simplicity, it incurs a considerable rise in memory footprint during checkpointing, particularly under update-intensive workloads. Memory provisioning emerges as a practical remedy to handle peak demands without compromising performance, albeit with potential concerns related to memory over-provisioning. In this article, we propose MDC+, a memory-efficient fork-based checkpointing scheme designed to maintain a reasonable memory footprint during checkpointing by leveraging collaboration among an IMDB, a user-level memory allocator, and the operating system. We explore two key techniques within the checkpointing scheme: (1) memory dump-based checkpointing, which enables early memory release, and (2) hint-based segregated memory allocation, which isolates immutable and updatable data to minimize page duplication. Our evaluation demonstrates that MDC+ significantly lowers peak memory footprint during checkpointing without affecting throughput or checkpointing time.

一致的检查点对于内存数据库（imdb）至关重要，但是它的资源密集型特性给云环境中的中小型部署带来了挑战，在这些环境中，内存利用率直接影响到操作成本。尽管传统的基于fork的检查点在性能和实现简单性方面具有优点，但它在检查点期间会导致内存占用的显著增加，特别是在更新密集型工作负载下。内存配置是在不影响性能的情况下处理峰值需求的实用补救措施，尽管存在与内存过度配置相关的潜在问题。在本文中，我们提出MDC+，这是一种内存高效的基于fork的检查点方案，旨在通过利用IMDB、用户级内存分配器和操作系统之间的协作，在检查点期间保持合理的内存占用。我们探讨了检查点方案中的两个关键技术：(1)基于内存转储的检查点，它允许早期内存释放；(2)基于提示的隔离内存分配，它隔离不可变和可更新的数据，以最大限度地减少页面重复。我们的评估表明，MDC+在检查点期间显著降低了峰值内存占用，而不会影响吞吐量或检查点时间。

{"title":"MDC+: A Cooperative Approach to Memory-Efficient Fork-Based Checkpointing for In-Memory Database Systems","authors":"Cheolgi Min;Jiwoong Park;Heon Young Yeom;Hyungsoo Jung","doi":"10.1109/TC.2025.3584520","DOIUrl":"https://doi.org/10.1109/TC.2025.3584520","url":null,"abstract":"Consistent checkpointing is a critical for in-memory databases (IMDBs) but its resource-intensive nature poses challenges for small- and medium-sized deployments in cloud environments, where memory utilization directly affects operational costs. Although traditional fork-based checkpointing offers merits in terms of performance and implementation simplicity, it incurs a considerable rise in memory footprint during checkpointing, particularly under update-intensive workloads. Memory provisioning emerges as a practical remedy to handle peak demands without compromising performance, albeit with potential concerns related to memory over-provisioning. In this article, we propose MDC+, a memory-efficient fork-based checkpointing scheme designed to maintain a reasonable memory footprint during checkpointing by leveraging collaboration among an IMDB, a user-level memory allocator, and the operating system. We explore two key techniques within the checkpointing scheme: (1) memory dump-based checkpointing, which enables early memory release, and (2) hint-based segregated memory allocation, which isolates immutable and updatable data to minimize page duplication. Our evaluation demonstrates that MDC+ significantly lowers peak memory footprint during checkpointing without affecting throughput or checkpointing time.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3059-3071"},"PeriodicalIF":3.8,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic DPU Offloading and Computational Resource Management in Heterogeneous Systems 异构系统中动态DPU卸载与计算资源管理

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-30 DOI: 10.1109/TC.2025.3584501

Zhaoyang Huang;Yanjie Tan;Yifu Zhu;Huailiang Tan;Keqin Li

DPU offloading has emerged as a promising way to enhance data processing efficiency and free up host CPU resources. However, unsuitable offloading may overwhelm the hardware and hurt overall system performance. It is still unclear how to make full use of the shared hardware resources and select optimal execution units for each tenant application. In this paper, we propose DORM, a dynamic DPU offloading and resource management architecture for multi-tenant cloud environments with CPU-DPU heterogeneous platforms. The primary goal of DORM is to minimize host resource consumption and maximize request processing efficiency. By establishing a joint optimization model for offloading decision and resource allocation, we abstract the problem into a mixed integer programming mathematical model. To simplify the complexity of model-solving, we decompose the model into two subproblems: a 0-1 integer programming model for offloading decision-making and a convex optimization problem for fine-grained resource allocation. Besides, DORM presents an orchestrator agent to detect load changes and dynamically adjust the scheduling strategy. Experimental results demonstrate that DORM significantly improves resource efficiency, reducing host CPU core usage by up to 83.3%, increasing per-core throughput by up to 4.61x, and lowering the latency by up to 58.5% compared to baseline systems.

DPU卸载已经成为提高数据处理效率和释放主机CPU资源的一种很有前途的方法。但是，不适当的卸载可能会使硬件不堪重负并损害系统的整体性能。如何充分利用共享硬件资源并为每个租户应用程序选择最佳执行单元仍然不清楚。在本文中，我们提出了DORM，这是一个针对cpu - cpu异构平台的多租户云环境的动态DPU卸载和资源管理架构。DORM的主要目标是最小化主机资源消耗和最大化请求处理效率。通过建立卸载决策与资源分配的联合优化模型，将该问题抽象为一个混合整数规划数学模型。为了简化模型求解的复杂性，我们将模型分解为两个子问题：用于卸载决策的0-1整数规划模型和用于细粒度资源分配的凸优化问题。此外，DORM还提供了一个编排代理来检测负载变化并动态调整调度策略。实验结果表明，与基线系统相比，DORM显著提高了资源效率，将主机CPU核心使用率降低了83.3%，将每核吞吐量提高了4.61倍，并将延迟降低了58.5%。

{"title":"Dynamic DPU Offloading and Computational Resource Management in Heterogeneous Systems","authors":"Zhaoyang Huang;Yanjie Tan;Yifu Zhu;Huailiang Tan;Keqin Li","doi":"10.1109/TC.2025.3584501","DOIUrl":"https://doi.org/10.1109/TC.2025.3584501","url":null,"abstract":"DPU offloading has emerged as a promising way to enhance data processing efficiency and free up host CPU resources. However, unsuitable offloading may overwhelm the hardware and hurt overall system performance. It is still unclear how to make full use of the shared hardware resources and select optimal execution units for each tenant application. In this paper, we propose DORM, a dynamic DPU offloading and resource management architecture for multi-tenant cloud environments with CPU-DPU heterogeneous platforms. The primary goal of DORM is to minimize host resource consumption and maximize request processing efficiency. By establishing a joint optimization model for offloading decision and resource allocation, we abstract the problem into a mixed integer programming mathematical model. To simplify the complexity of model-solving, we decompose the model into two subproblems: a 0-1 integer programming model for offloading decision-making and a convex optimization problem for fine-grained resource allocation. Besides, DORM presents an orchestrator agent to detect load changes and dynamically adjust the scheduling strategy. Experimental results demonstrate that DORM significantly improves resource efficiency, reducing host CPU core usage by up to 83.3%, increasing per-core throughput by up to 4.61x, and lowering the latency by up to 58.5% compared to baseline systems.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3046-3058"},"PeriodicalIF":3.8,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BHerd: Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients 通过选择有益的局部梯度群来加速联邦学习

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-27 DOI: 10.1109/TC.2025.3583827

Ping Luo;Xiaoge Deng;Ziqing Wen;Tao Sun;Dongsheng Li

In the domain of computer architecture, Federated Learning (FL) is a paradigm of distributed machine learning in edge systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples is beneficial for accelerating model convergence. In pursuit of this subset, a reliable approach involves determining a measure of validity to rank the samples within the dataset. In this paper, we propose the BHerd strategy, which selects a beneficial herd of local gradients to accelerate the convergence of the FL model. Specifically, we map the distribution of the local dataset to the local gradients and use the Herding strategy to obtain a permutation of the set of gradients, where the more advanced gradients in the permutation are closer to the average of the set of gradients. These top portions of the gradients will be selected and sent to the server for global aggregation. We conduct experiments on different datasets, models, and scenarios by building a prototype system, and experimental results demonstrate that our BHerd strategy is effective in selecting beneficial local gradients to mitigate the effects brought by the Non-IID dataset.

在计算机体系结构领域，联邦学习（FL）是边缘系统中分布式机器学习的一个范例。然而，系统的非独立同分布（Non-IID）数据会对全局模型的收敛效率产生负面影响，因为只有这些数据样本的一个子集有利于加速模型的收敛。在追求这个子集时，一个可靠的方法包括确定一个有效性度量来对数据集中的样本进行排序。在本文中，我们提出了BHerd策略，该策略选择一个有利的局部梯度群来加速FL模型的收敛。具体来说，我们将局部数据集的分布映射到局部梯度，并使用Herding策略获得梯度集的置换，其中置换中更高级的梯度更接近梯度集的平均值。将选择这些梯度的顶部部分并将其发送到服务器以进行全局聚合。通过构建原型系统，在不同的数据集、模型和场景下进行了实验，实验结果表明，BHerd策略可以有效地选择有益的局部梯度来减轻非iid数据集带来的影响。

{"title":"BHerd: Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients","authors":"Ping Luo;Xiaoge Deng;Ziqing Wen;Tao Sun;Dongsheng Li","doi":"10.1109/TC.2025.3583827","DOIUrl":"https://doi.org/10.1109/TC.2025.3583827","url":null,"abstract":"In the domain of computer architecture, Federated Learning (FL) is a paradigm of distributed machine learning in edge systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples is beneficial for accelerating model convergence. In pursuit of this subset, a reliable approach involves determining a measure of validity to rank the samples within the dataset. In this paper, we propose the BHerd strategy, which selects a beneficial herd of local gradients to accelerate the convergence of the FL model. Specifically, we map the distribution of the local dataset to the local gradients and use the Herding strategy to obtain a permutation of the set of gradients, where the more advanced gradients in the permutation are closer to the average of the set of gradients. These top portions of the gradients will be selected and sent to the server for global aggregation. We conduct experiments on different datasets, models, and scenarios by building a prototype system, and experimental results demonstrate that our BHerd strategy is effective in selecting beneficial local gradients to mitigate the effects brought by the Non-IID dataset.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2977-2990"},"PeriodicalIF":3.8,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Energy-Efficient and Privacy-Aware MEC-Enabled IoMT Health Monitoring System 节能和隐私意识mec支持IoMT健康监测系统

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-20 DOI: 10.1109/TC.2025.3576944

Xiaolu Cheng;Xiaoshuang Xing;Wei Li;Hong Xue;Tong Can

Advancements in the Internet of Medical Things (IoMT) have made remote patient monitoring increasingly viable. However, challenges persist in safeguarding sensitive data, optimizing resources, and addressing the energy constraints of patient devices. This paper presents a health monitoring framework integrating Mobile Edge Computing (MEC) and sixth-generation (6G) technologies, structured into internal Medical Body Area Networks (int-MBANs) and external communications beyond MBANs (ext-MBANs). For int-MBANs, the proposed OptiBand algorithm optimizes energy consumption, extends device standby time, and considers message timeliness and medical criticality. A key innovation of OptiBand is its incorporation of patient’s device standby time into the resource allocation strategy to address real-world patient needs. For ext-MBANs, the DynaMEC algorithm dynamically balances energy efficiency, privacy protection, latency, and fairness, even under varying patient scales. A latency-aware scheduling mechanism also be introduced to guarantee timely completion of emergency tasks. Theoretical analysis and experimental results confirm the feasibility, convergence, and optimality of both algorithms. These characteristics and advantages of the proposed system make remote patient monitoring through IoMT more feasible and effective.

医疗物联网（IoMT）的进步使得远程患者监测越来越可行。然而，在保护敏感数据、优化资源和解决患者设备的能源限制方面，挑战仍然存在。本文提出了一个集成移动边缘计算（MEC）和第六代（6G）技术的健康监测框架，该框架被构建为内部医疗体域网络（int- mban）和mban之外的外部通信（ext- mban）。对于int- mban，本文提出的OptiBand算法优化了能耗，延长了设备待机时间，并考虑了消息及时性和医疗危机性。OptiBand的一项关键创新是将患者的设备待机时间纳入资源分配策略，以满足现实世界患者的需求。对于ext-MBANs， DynaMEC算法动态平衡能源效率、隐私保护、延迟和公平性，即使在不同的患者规模下也是如此。还引入了延迟感知调度机制，以保证及时完成紧急任务。理论分析和实验结果证实了两种算法的可行性、收敛性和最优性。该系统的这些特点和优势使得通过IoMT进行患者远程监护更加可行和有效。

{"title":"An Energy-Efficient and Privacy-Aware MEC-Enabled IoMT Health Monitoring System","authors":"Xiaolu Cheng;Xiaoshuang Xing;Wei Li;Hong Xue;Tong Can","doi":"10.1109/TC.2025.3576944","DOIUrl":"https://doi.org/10.1109/TC.2025.3576944","url":null,"abstract":"Advancements in the Internet of Medical Things (IoMT) have made remote patient monitoring increasingly viable. However, challenges persist in safeguarding sensitive data, optimizing resources, and addressing the energy constraints of patient devices. This paper presents a health monitoring framework integrating Mobile Edge Computing (MEC) and sixth-generation (6G) technologies, structured into internal Medical Body Area Networks (int-MBANs) and external communications beyond MBANs (ext-MBANs). For int-MBANs, the proposed OptiBand algorithm optimizes energy consumption, extends device standby time, and considers message timeliness and medical criticality. A key innovation of OptiBand is its incorporation of patient’s device standby time into the resource allocation strategy to address real-world patient needs. For ext-MBANs, the DynaMEC algorithm dynamically balances energy efficiency, privacy protection, latency, and fairness, even under varying patient scales. A latency-aware scheduling mechanism also be introduced to guarantee timely completion of emergency tasks. Theoretical analysis and experimental results confirm the feasibility, convergence, and optimality of both algorithms. These characteristics and advantages of the proposed system make remote patient monitoring through IoMT more feasible and effective.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2936-2949"},"PeriodicalIF":3.8,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Static Schedules for Fault-Tolerant Transmissions on Shared Media 共享媒体上容错传输的有效静态调度

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-04 DOI: 10.1109/TC.2025.3576908

Scott Sirri;Zhe Wang;Netanel Raviv;Jeremy Fineman;Kunal Agrawal

Shared communication media are widely used in many applications including safety-critical applications. However, noise and transient errors can cause transmission failures. We consider the problem of designing and minimizing the length of fault-tolerant static schedules for transmitting messages in these media provided the number of errors fall below some upper bound. To transmit n messages in a medium while tolerating a maximum of f faults, prior work had shown how to construct schedules which had a fault tolerance overhead of

$nf/2$

. In this paper, we provide an efficient constructive algorithm for producing a schedule for n messages with total length

$n,+,O(f^{2},mathbf{log}^{2},n)$

that can tolerate f medium errors. We also provide an algorithm for randomly generating fault-tolerant schedules with length

$n,+,O(f,mathbf{log}(f)mathbf{log}(n))$

as well as a technique for quickly verifying these on reasonably small inputs.

共享通信媒体广泛应用于包括安全关键应用在内的许多应用中。然而，噪声和瞬态误差会导致传输失败。我们考虑了在错误数低于某一上限的情况下，在这些介质中设计和最小化消息传输的容错静态调度长度的问题。为了在一个介质中传输n条消息，同时允许最多f个错误，先前的工作已经展示了如何构造容错开销为$nf/2$的调度。在本文中，我们提供了一种有效的构造算法，用于生成总长度为$n,+，O(f^{2},mathbf{log}^{2},n)$的n个消息的调度，该调度可以容忍f个介质错误。我们还提供了一种算法，用于随机生成长度为$n,+，O(f,mathbf{log}(f)mathbf{log}(n))$的容错调度，以及一种技术，用于在相当小的输入上快速验证这些调度。

{"title":"Efficient Static Schedules for Fault-Tolerant Transmissions on Shared Media","authors":"Scott Sirri;Zhe Wang;Netanel Raviv;Jeremy Fineman;Kunal Agrawal","doi":"10.1109/TC.2025.3576908","DOIUrl":"https://doi.org/10.1109/TC.2025.3576908","url":null,"abstract":"Shared communication media are widely used in many applications including safety-critical applications. However, noise and transient errors can cause transmission failures. We consider the problem of designing and minimizing the length of fault-tolerant static schedules for transmitting messages in these media provided the number of errors fall below some upper bound. To transmit n messages in a medium while tolerating a maximum of f faults, prior work had shown how to construct schedules which had a fault tolerance overhead of <inline-formula><tex-math>$nf/2$</tex-math></inline-formula>. In this paper, we provide an efficient constructive algorithm for producing a schedule for n messages with total length <inline-formula><tex-math>$n,+,O(f^{2},mathbf{log}^{2},n)$</tex-math></inline-formula> that can tolerate f medium errors. We also provide an algorithm for randomly generating fault-tolerant schedules with length <inline-formula><tex-math>$n,+,O(f,mathbf{log}(f)mathbf{log}(n))$</tex-math></inline-formula> as well as a technique for quickly verifying these on reasonably small inputs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2882-2895"},"PeriodicalIF":3.8,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Performance In-Memory Bayesian Inference With Multi-Bit Ferroelectric FET 基于多比特铁电场效应管的高性能内存贝叶斯推断

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-04 DOI: 10.1109/TC.2025.3576941

Chao Li;Xuchu Huang;Zhicheng Xu;Bo Wen;Ruibin Mao;Min Zhou;Thomas Kämpfe;Kai Ni;Can Li;Xunzhao Yin;Cheng Zhuo

Conventional neural network-based machine learning algorithms often encounter difficulties in data-limited scenarios or where interpretability is critical. Conversely, Bayesian inference-based models excel with reliable uncertainty estimates and explainable predictions. Recently, many in-memory computing (IMC) architectures achieve exceptional computing capacity and efficiency for neural network tasks leveraging emerging non-volatile memory (NVM) technologies. However, their application in Bayesian inference remains limited because the operations in Bayesian inference differ substantially from those in neural networks. In this article, we introduce a compact in-memory Bayesian inference engine with high efficiency and performance utilizing a multi-bit ferroelectric field-effect transistor (FeFET). This design encodes a Bayesian model within a compact FeFET-based crossbar by mapping quantized probabilities to discrete FeFET states. Consequently, the crossbar’s outputs naturally represent the output posteriors of the Bayesian model. Our design facilitates efficient Bayesian inference, accommodating various input types and probability precisions, without additional calculation circuitry. As the first FeFET-based in-memory Bayesian inference engine, our design demonstrates a notable storage density of 26.32 Mb/mm² and a computing efficiency of 581.40 TOPS/W in a representative Bayesian classification task, indicating a 10.7×/43.4× compactness/efficiency improvement compared to the state-of-the-art alternative. Utilizing the proposed Bayesian inference engine, we develop a feature selection system that efficiently addresses a representative NP-hard optimization problem, showcasing our design’s capability and potential to enhance various Bayesian inference-based applications. Test results suggest that our design identifies the essential features, enhancing the model’s performance while reducing its complexity, surpassing the latest implementation in operation speed and algorithm efficiency by 2.9×/2.0×, respectively.

传统的基于神经网络的机器学习算法在数据有限的情况下或在可解释性至关重要的情况下经常遇到困难。相反，基于贝叶斯推理的模型在可靠的不确定性估计和可解释的预测方面表现出色。最近，许多内存计算（IMC）架构利用新兴的非易失性存储器（NVM）技术为神经网络任务实现了卓越的计算能力和效率。然而，由于贝叶斯推理中的操作与神经网络中的操作有很大的不同，它们在贝叶斯推理中的应用仍然有限。在本文中，我们介绍了一个紧凑的内存贝叶斯推理引擎，具有高效率和高性能，利用多比特的铁电场效应晶体管（FeFET）。本设计通过将量化概率映射到离散的ffet状态，在紧凑的基于ffet的交叉条中编码贝叶斯模型。因此，横条的输出自然代表贝叶斯模型的输出后验。我们的设计有助于有效的贝叶斯推理，适应各种输入类型和概率精度，而无需额外的计算电路。作为第一个基于fefet的内存贝叶斯推理引擎，我们的设计在一个代表性的贝叶斯分类任务中显示了26.32 Mb/mm2的显著存储密度和581.40 TOPS/W的计算效率，与最先进的替代方案相比，表明了10.7倍/ 44.4倍的紧凑性/效率提高。利用提出的贝叶斯推理引擎，我们开发了一个特征选择系统，该系统有效地解决了一个代表性的np困难优化问题，展示了我们的设计的能力和潜力，以增强各种基于贝叶斯推理的应用。测试结果表明，我们的设计识别了基本特征，在提高模型性能的同时降低了模型的复杂性，在运算速度和算法效率上分别超过最新实现的2.9×/2.0×。

{"title":"High-Performance In-Memory Bayesian Inference With Multi-Bit Ferroelectric FET","authors":"Chao Li;Xuchu Huang;Zhicheng Xu;Bo Wen;Ruibin Mao;Min Zhou;Thomas Kämpfe;Kai Ni;Can Li;Xunzhao Yin;Cheng Zhuo","doi":"10.1109/TC.2025.3576941","DOIUrl":"https://doi.org/10.1109/TC.2025.3576941","url":null,"abstract":"Conventional neural network-based machine learning algorithms often encounter difficulties in data-limited scenarios or where interpretability is critical. Conversely, Bayesian inference-based models excel with reliable uncertainty estimates and explainable predictions. Recently, many in-memory computing (IMC) architectures achieve exceptional computing capacity and efficiency for neural network tasks leveraging emerging non-volatile memory (NVM) technologies. However, their application in Bayesian inference remains limited because the operations in Bayesian inference differ substantially from those in neural networks. In this article, we introduce a compact in-memory Bayesian inference engine with high efficiency and performance utilizing a multi-bit ferroelectric field-effect transistor (FeFET). This design encodes a Bayesian model within a compact FeFET-based crossbar by mapping quantized probabilities to discrete FeFET states. Consequently, the crossbar’s outputs naturally represent the output posteriors of the Bayesian model. Our design facilitates efficient Bayesian inference, accommodating various input types and probability precisions, without additional calculation circuitry. As the first FeFET-based in-memory Bayesian inference engine, our design demonstrates a notable storage density of 26.32 Mb/mm2 and a computing efficiency of 581.40 TOPS/W in a representative Bayesian classification task, indicating a 10.7×/43.4× compactness/efficiency improvement compared to the state-of-the-art alternative. Utilizing the proposed Bayesian inference engine, we develop a feature selection system that efficiently addresses a representative NP-hard optimization problem, showcasing our design’s capability and potential to enhance various Bayesian inference-based applications. Test results suggest that our design identifies the essential features, enhancing the model’s performance while reducing its complexity, surpassing the latest implementation in operation speed and algorithm efficiency by 2.9×/2.0×, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2923-2935"},"PeriodicalIF":3.8,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAL-PIM: A Subarray-Level Processing-in-Memory Architecture With LUT-Based Linear Interpolation for Transformer-Based Text Generation 基于lut的线性插值的子数组级内存处理体系结构，用于基于转换的文本生成

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-04 DOI: 10.1109/TC.2025.3576935

Wontak Han;Hyunjun Cho;Donghyuk Kim;Joo-Young Kim

Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. Although many deep learning models have been proposed, the recent emergence of transformer-based large language models advances its academic research and industry development, showing remarkable qualitative results in text generation. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we propose a subarray-level processing-in-memory (PIM) architecture named SAL-PIM, the first HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. With optimized data mapping schemes for different operations, SAL-PIM utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units (S-ALUs) next to memory subarrays. To minimize the area overhead for S-ALU, it uses shared MACs leveraging slow clock frequency of commands for the same bank. In addition, a few subarrays in the bank are used as look-up tables (LUTs) to handle non-linear functions in PIM, supporting multiple addressing to select sections for linear interpolation. Lastly, the channel-level arithmetic logic unit (C-ALU) is added in the buffer die of HBM to perform the accumulation and reduce-sum operations of data across multiple banks, completing end-to-end inference on PIM. To validate the SAL-PIM architecture, we built a cycle-accurate simulator based on Ramulator. We also implemented the SAL-PIM’s logic units in 28-nm CMOS technology and scaled the results to DRAM technology to verify its feasibility. We measured the end-to-end latency of SAL-PIM when it runs various text generation workloads on the GPT-2 medium model (with 345 million parameters), in which the input and output token numbers vary from 32 to 128 and from 1 to 256, respectively. As a result, with 4.81% area overhead, SAL-PIM achieves up to 4.72× speedup (1.83× on average) over the Nvidia Titan RTX GPU running Faster Transformer Framework.

文本生成是自然语言处理的一个引人注目的子领域，旨在从输入的单词生成人类可读的文本。尽管已经提出了许多深度学习模型，但最近基于转换器的大型语言模型的出现推进了其学术研究和行业发展，在文本生成方面显示出显著的定性结果。特别是，仅用于解码器的生成模型，如生成预训练转换器（GPT），被广泛用于文本生成，其计算阶段主要有两个：摘要和生成。摘要阶段可以并行处理输入标记，而生成阶段由于其通过迭代顺序生成输出标记，因此难以加速。此外，每次迭代都需要读取整个模型，几乎没有数据重用的机会。因此，基于转换器的文本生成工作负载受到严重的内存限制，使外部内存带宽成为系统的瓶颈。在本文中，我们提出了一种名为salpim的子数组级内存处理（PIM）体系结构，这是第一个用于端到端加速基于转换器的文本生成的基于hbm的PIM体系结构。通过优化不同操作的数据映射方案，SAL-PIM通过在内存子阵列旁边集成多个子阵列级算术逻辑单元（s - alu）来利用更高的内部带宽。为了最小化S-ALU的面积开销，它使用共享mac，利用同一银行命令的慢时钟频率。此外，库中的一些子数组用作查找表（lut）来处理PIM中的非线性函数，支持多个寻址来选择用于线性插值的部分。最后，在HBM的缓冲芯片中加入信道级算术逻辑单元（C-ALU），对多个银行的数据进行累加和约简运算，完成对PIM的端到端推理。为了验证SAL-PIM结构，我们建立了一个基于Ramulator的周期精确模拟器。我们还在28纳米CMOS技术上实现了SAL-PIM的逻辑单元，并将结果扩展到DRAM技术以验证其可行性。当SAL-PIM在GPT-2介质模型（具有3.45亿个参数）上运行各种文本生成工作负载时，我们测量了它的端到端延迟，其中输入和输出令牌数分别从32到128和1到256不等。因此，在4.81%的面积开销下，SAL-PIM在运行更快的Transformer框架的Nvidia Titan RTX GPU上实现了高达4.72倍的加速（平均1.83倍）。

{"title":"SAL-PIM: A Subarray-Level Processing-in-Memory Architecture With LUT-Based Linear Interpolation for Transformer-Based Text Generation","authors":"Wontak Han;Hyunjun Cho;Donghyuk Kim;Joo-Young Kim","doi":"10.1109/TC.2025.3576935","DOIUrl":"https://doi.org/10.1109/TC.2025.3576935","url":null,"abstract":"Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. Although many deep learning models have been proposed, the recent emergence of transformer-based large language models advances its academic research and industry development, showing remarkable qualitative results in text generation. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we propose a subarray-level processing-in-memory (PIM) architecture named SAL-PIM, the first HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. With optimized data mapping schemes for different operations, SAL-PIM utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units (S-ALUs) next to memory subarrays. To minimize the area overhead for S-ALU, it uses shared MACs leveraging slow clock frequency of commands for the same bank. In addition, a few subarrays in the bank are used as look-up tables (LUTs) to handle non-linear functions in PIM, supporting multiple addressing to select sections for linear interpolation. Lastly, the channel-level arithmetic logic unit (C-ALU) is added in the buffer die of HBM to perform the accumulation and reduce-sum operations of data across multiple banks, completing end-to-end inference on PIM. To validate the SAL-PIM architecture, we built a cycle-accurate simulator based on Ramulator. We also implemented the SAL-PIM’s logic units in 28-nm CMOS technology and scaled the results to DRAM technology to verify its feasibility. We measured the end-to-end latency of SAL-PIM when it runs various text generation workloads on the GPT-2 medium model (with 345 million parameters), in which the input and output token numbers vary from 32 to 128 and from 1 to 256, respectively. As a result, with 4.81% area overhead, SAL-PIM achieves up to 4.72× speedup (1.83× on average) over the Nvidia Titan RTX GPU running Faster Transformer Framework.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2909-2922"},"PeriodicalIF":3.8,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Serving MoE Models on Resource-Constrained Edge Devices via Dynamic Expert Swapping 基于动态专家交换的资源受限边缘设备MoE模型服务

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575905

Rui Kong;Yuanchun Li;Weijun Wang;Linghe Kong;Yunxin Liu

Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.

混合专家（MoE）是深度学习中的一种流行技术，它利用条件激活的并行神经网络模块（专家）来提高模型的能力。然而，由于模型大小和复杂性的显著增加，在资源受限的延迟关键边缘场景中提供MoE模型是具有挑战性的。在本文中，我们首先分析了连续推理场景下MoE模型的行为模式，从而得出了专家激活的三个关键观察结果，包括时间局部性、可交换性和可跳过计算。基于这些观察，我们引入了PC-MoE推理框架，用于资源受限的连续MoE模型服务。PC-MoE的核心是一个新的数据结构，参数委员会，它智能地维护一个重要的专家子集，以减少资源消耗。为了评估PC-MoE的有效性，我们使用最先进的MoE模型对常见的计算机视觉和自然语言处理任务进行了实验。结果表明PC-MoE实现了资源消耗和模型精度之间的最佳权衡。例如，在使用swan - moe模型的目标检测任务上，我们的方法可以将内存使用和延迟分别减少42.34%和18.63%，而准确率仅下降0.10%。

{"title":"Serving MoE Models on Resource-Constrained Edge Devices via Dynamic Expert Swapping","authors":"Rui Kong;Yuanchun Li;Weijun Wang;Linghe Kong;Yunxin Liu","doi":"10.1109/TC.2025.3575905","DOIUrl":"https://doi.org/10.1109/TC.2025.3575905","url":null,"abstract":"Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, <italic>Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2799-2811"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0