首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
Efficient Sketching for Heavy Item-Oriented Data Stream Mining With Memory Constraints 基于内存约束的面向大项目数据流挖掘的高效草图
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604467
Weihe Li;Paul Patras
Accurate and fast data stream mining is critical to many tasks, including real-time series analysis for mobile sensor data, big data management and machine learning. Various heavy-oriented item detection tasks, such as identifying heavy hitters, heavy changers, persistent items, and significant items, have garnered considerable attention from both industry and academia. Unfortunately, as data stream speeds continue to increase and the available memory, particularly in L1 cache, remains limited for real-time processing, existing schemes face challenges in simultaneously achieving high detection accuracy, memory efficiency, and fast update throughput, as we reveal. To tackle this conundrum, we propose a versatile and elegant sketch framework named Tight-Sketch, which supports a spectrum of heavy-based detection tasks. Recognizing that, in practice, most items are cold (non-heavy/persistent/significant), we implement distinct eviction strategies for different item types. This approach allows us to swiftly discard potentially cold items while offering enhanced protection to hot ones (heavy/persistent/significant). Additionally, we introduce an eviction method based on stochastic decay, ensuring that Tight-Sketch incurs only small one-sided errors without overestimation. To further enhance detection accuracy under extremely constrained memory allocations, we introduce Tight-Opt, a variant incorporating two optimization strategies. We conduct extensive experiments across various detection tasks to demonstrate that Tight-Sketch significantly outperforms existing methods in terms of both accuracy and update speed. Furthermore, by utilizing Single Instruction Multiple Data (SIMD) instructions, we enhance Tight-Sketch’s update throughput by up to 36%. We also implement Tight-Sketch on FPGA to validate its practicality and low resource overhead in hardware deployments.
准确、快速的数据流挖掘对于许多任务至关重要,包括移动传感器数据的实时序列分析、大数据管理和机器学习。各种面向重型的项目检测任务,如识别重型打击者、重型改变者、持久项目和重要项目,已经引起了工业界和学术界的广泛关注。不幸的是,随着数据流速度的不断提高,可用内存(特别是L1缓存中的内存)用于实时处理的能力仍然有限,现有方案在同时实现高检测精度、内存效率和快速更新吞吐量方面面临挑战。为了解决这个难题,我们提出了一个名为Tight-Sketch的通用而优雅的草图框架,它支持一系列基于重型的检测任务。认识到,在实践中,大多数项目是冷的(非重/持久/重要),我们对不同的项目类型实施不同的驱逐策略。这种方法允许我们快速丢弃潜在的冷物品,同时为热物品(重/持久/重要)提供增强的保护。此外,我们还引入了一种基于随机衰减的剔除方法,以确保Tight-Sketch只产生很小的单侧误差而不会产生高估。为了进一步提高在极度受限的内存分配下的检测准确性,我们引入了一种包含两种优化策略的变体Tight-Opt。我们在各种检测任务中进行了广泛的实验,以证明Tight-Sketch在准确性和更新速度方面显着优于现有方法。此外,通过使用单指令多数据(SIMD)指令,我们将Tight-Sketch的更新吞吐量提高了36%。我们还在FPGA上实现了Tight-Sketch,以验证其在硬件部署中的实用性和低资源开销。
{"title":"Efficient Sketching for Heavy Item-Oriented Data Stream Mining With Memory Constraints","authors":"Weihe Li;Paul Patras","doi":"10.1109/TC.2025.3604467","DOIUrl":"https://doi.org/10.1109/TC.2025.3604467","url":null,"abstract":"Accurate and fast data stream mining is critical to many tasks, including real-time series analysis for mobile sensor data, big data management and machine learning. Various heavy-oriented item detection tasks, such as identifying heavy hitters, heavy changers, persistent items, and significant items, have garnered considerable attention from both industry and academia. Unfortunately, as data stream speeds continue to increase and the available memory, particularly in L1 cache, remains limited for real-time processing, existing schemes face challenges in simultaneously achieving high detection accuracy, memory efficiency, and fast update throughput, as we reveal. To tackle this conundrum, we propose a versatile and elegant sketch framework named Tight-Sketch, which supports a spectrum of heavy-based detection tasks. Recognizing that, in practice, most items are cold (non-heavy/persistent/significant), we implement distinct eviction strategies for different item types. This approach allows us to swiftly discard potentially cold items while offering enhanced protection to hot ones (heavy/persistent/significant). Additionally, we introduce an eviction method based on stochastic decay, ensuring that Tight-Sketch incurs only small one-sided errors without overestimation. To further enhance detection accuracy under extremely constrained memory allocations, we introduce Tight-Opt, a variant incorporating two optimization strategies. We conduct extensive experiments across various detection tasks to demonstrate that Tight-Sketch significantly outperforms existing methods in terms of both accuracy and update speed. Furthermore, by utilizing Single Instruction Multiple Data (SIMD) instructions, we enhance Tight-Sketch’s update throughput by up to 36%. We also implement Tight-Sketch on FPGA to validate its practicality and low resource overhead in hardware deployments.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3845-3859"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Load Balancing Scheduling for Batch-Ordered Job-Store: Online vs. Offline 批排序作业存储的负载平衡调度:在线vs.离线
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3603725
Mengbing Zhou;Yang Wang;Bocong Zhao;Chengzhong Xu
Efficient resource utilization is crucial in real-world applications, especially for balancing loads across machines handling specific job types. This paper introduces a novel batch-ordered job-store scheduling model, where jobs in a batch are scheduled sequentially, with their operations allocated in a round-robin fashion across two scenarios. We establish that this problem is NP-hard and analyze it in both online and offline settings. In the online case, we first examine the exclusive scenario, where operations within the same job must be scheduled on different machines, and show that a load greedy (LG) algorithm achieves a tight competitive ratio of $2-frac{1}{m}$, with $m$ representing the number of machines. Next, we consider the circular scenario, which requires maintaining the circular order of operations across ordered machines. In this context, we analyze potential anomalies in load distribution during local optimality achieved by the ordered load greedy (OLG) algorithm and provide bounds on the occurrence of these anomalies and the maximum load in each local scheduling round. In the offline case, we abstract each OLG scheduling process as a generalized circular sequence alignment (CSA) problem and develop a dynamic programming-based matching (DPM) algorithm to solve it. To further enhance load balancing, we develop a dynamic programming-based optimization (DPO) algorithm to schedule multiple jobs simultaneously in both scenarios. Experimental results confirm the efficiency of DPM for the CSA problem, and we validate the load balancing effectiveness of both online and offline algorithms using real traffic datasets. These theoretical findings and algorithmic implementations lay a solid groundwork for future practical advancements.
在实际应用程序中,有效的资源利用是至关重要的,特别是在处理特定作业类型的机器之间平衡负载时。本文介绍了一种新的批排序作业存储调度模型,其中批处理中的作业按顺序调度,其操作在两个场景中以循环方式分配。我们建立了这个问题是np困难的,并在在线和离线设置下分析了它。在在线情况下,我们首先检查排他场景,其中同一作业中的操作必须安排在不同的机器上,并显示负载贪婪(LG)算法实现了$2-frac{1}{m}$的紧密竞争比,其中$m$表示机器数量。接下来,我们考虑循环场景,它需要在有序机器之间维护操作的循环顺序。在此背景下,我们分析了在有序负载贪婪(OLG)算法实现局部最优时负载分布中的潜在异常,并给出了这些异常发生的边界和每个局部调度轮的最大负载。在离线情况下,我们将每个OLG调度过程抽象为一个广义的圆序列比对(CSA)问题,并开发了一种基于动态规划的匹配(DPM)算法来解决它。为了进一步增强负载平衡,我们开发了一种基于动态规划的优化(DPO)算法,在这两种情况下同时调度多个作业。实验结果证实了DPM算法在CSA问题上的有效性,并利用真实交通数据集验证了在线和离线算法的负载均衡有效性。这些理论发现和算法实现为未来的实际进展奠定了坚实的基础。
{"title":"Load Balancing Scheduling for Batch-Ordered Job-Store: Online vs. Offline","authors":"Mengbing Zhou;Yang Wang;Bocong Zhao;Chengzhong Xu","doi":"10.1109/TC.2025.3603725","DOIUrl":"https://doi.org/10.1109/TC.2025.3603725","url":null,"abstract":"Efficient resource utilization is crucial in real-world applications, especially for balancing loads across machines handling specific job types. This paper introduces a novel batch-ordered job-store scheduling model, where jobs in a batch are scheduled sequentially, with their operations allocated in a round-robin fashion across two scenarios. We establish that this problem is NP-hard and analyze it in both online and offline settings. In the online case, we first examine the exclusive scenario, where operations within the same job must be scheduled on different machines, and show that a load greedy (LG) algorithm achieves a tight competitive ratio of <inline-formula><tex-math>$2-frac{1}{m}$</tex-math></inline-formula>, with <inline-formula><tex-math>$m$</tex-math></inline-formula> representing the number of machines. Next, we consider the circular scenario, which requires maintaining the circular order of operations across ordered machines. In this context, we analyze potential anomalies in load distribution during local optimality achieved by the ordered load greedy (OLG) algorithm and provide bounds on the occurrence of these anomalies and the maximum load in each local scheduling round. In the offline case, we abstract each OLG scheduling process as a generalized circular sequence alignment (CSA) problem and develop a dynamic programming-based matching (DPM) algorithm to solve it. To further enhance load balancing, we develop a dynamic programming-based optimization (DPO) algorithm to schedule multiple jobs simultaneously in both scenarios. Experimental results confirm the efficiency of DPM for the CSA problem, and we validate the load balancing effectiveness of both online and offline algorithms using real traffic datasets. These theoretical findings and algorithmic implementations lay a solid groundwork for future practical advancements.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3778-3791"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Path Bound for Parallel Tasks With Conditional Branches 具有条件分支的并行任务的多路径约束
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604469
Qingqiang He;Nan Guan;Zhe Jiang;Mingsong Lv
Parallel execution and conditional execution are increasingly prevalent in modern embedded systems. In real-time scheduling, a fundamental problem is how to upper-bound the response times of a task. Recent work applied the multi-path technique to reduce the response time bound for tasks with parallel execution, but left tasks with conditional execution as an open problem. This paper focuses on upper-bounding response times for tasks with both parallel execution and conditional execution using the multi-path technique. By designing a delicate abstraction regarding the multiple paths of various conditional branches, we derive a new response time bound. We further apply this response time bound into the scheduling of multiple parallel tasks with conditional branches. Experiments demonstrate that the proposed bound significantly advances the state-of-the-art, reducing the response time bound by 9.4% and improving the schedulability by 31.2% on average.
并行执行和条件执行在现代嵌入式系统中越来越普遍。在实时调度中,一个基本问题是如何给任务的响应时间设定上限。最近的工作应用多路径技术来减少并行执行任务的响应时间限制,但将条件执行任务作为一个开放的问题。本文利用多路径技术研究并行执行和条件执行任务的上限响应时间。通过对各种条件分支的多条路径进行细致的抽象,我们得到了一个新的响应时间界限。我们进一步将此响应时间限制应用于具有条件分支的多个并行任务的调度。实验表明,该方法显著提高了算法的性能,平均缩短了9.4%的响应时间,提高了31.2%的可调度性。
{"title":"Multi-Path Bound for Parallel Tasks With Conditional Branches","authors":"Qingqiang He;Nan Guan;Zhe Jiang;Mingsong Lv","doi":"10.1109/TC.2025.3604469","DOIUrl":"https://doi.org/10.1109/TC.2025.3604469","url":null,"abstract":"Parallel execution and conditional execution are increasingly prevalent in modern embedded systems. In real-time scheduling, a fundamental problem is how to upper-bound the response times of a task. Recent work applied the multi-path technique to reduce the response time bound for tasks with parallel execution, but left tasks with conditional execution as an open problem. This paper focuses on upper-bounding response times for tasks with both parallel execution and conditional execution using the multi-path technique. By designing a delicate abstraction regarding the multiple paths of various conditional branches, we derive a new response time bound. We further apply this response time bound into the scheduling of multiple parallel tasks with conditional branches. Experiments demonstrate that the proposed bound significantly advances the state-of-the-art, reducing the response time bound by 9.4% and improving the schedulability by 31.2% on average.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3873-3887"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Conjunctive Geometric Range Query Over Encrypted Spatial Data With Learned Index 基于学习索引的加密空间数据的高效合取几何距离查询
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604470
Mingyue Li;Chunfu Jia;Ruizhong Du;Guanxiong Ha
With the increasing popularity of geo-positioning technologies and mobile Internet, spatial data query services have attracted extensive attention. To protect the confidentiality of sensitive information outsourced to cloud servers, much efforts have been devoted to designing geometric range query schemes over encrypted spatial data without affecting availability. However, existing works focus on the privacy-preserving schemes with traditional tree indexes, causing more computing and storage issues. In this paper, we propose an efficient conjunctive geometric range query scheme over encrypted spatial data with a learned index. In particular, we design a new privacy-preserving learned index for spatial data to reduce the search space and storage overhead. The main idea is to add noise disturbance to the objective function instead of directly adding it to output results, reducing the leakage of private information and ensuring the correctness of output results. Moreover, we propose a spatial segmentation algorithm to avoid accessing a large number of unnecessary Z codes in the query process. The formal security analysis shows that our scheme ensures index data security and query privacy. Simulation results show that the query efficiency is improved while the storage overhead is significantly reduced compared with the state-of-the-art schemes.
随着地理定位技术和移动互联网的日益普及,空间数据查询服务受到了广泛关注。为了保护外包给云服务器的敏感信息的机密性,人们一直致力于在不影响可用性的情况下设计加密空间数据的几何范围查询方案。然而,现有的工作主要集中在传统树索引的隐私保护方案上,造成了更多的计算和存储问题。在本文中,我们提出了一种基于学习索引的加密空间数据合取几何距离查询方案。特别地,我们为空间数据设计了一种新的保护隐私的学习索引,以减少搜索空间和存储开销。其主要思想是在目标函数中加入噪声干扰,而不是直接在输出结果中加入噪声干扰,减少了私有信息的泄露,保证了输出结果的正确性。此外,我们提出了一种空间分割算法,以避免在查询过程中访问大量不必要的Z码。形式化的安全性分析表明,该方案保证了索引数据的安全性和查询的私密性。仿真结果表明,与现有的查询方案相比,该方案在提高查询效率的同时显著降低了存储开销。
{"title":"Efficient Conjunctive Geometric Range Query Over Encrypted Spatial Data With Learned Index","authors":"Mingyue Li;Chunfu Jia;Ruizhong Du;Guanxiong Ha","doi":"10.1109/TC.2025.3604470","DOIUrl":"https://doi.org/10.1109/TC.2025.3604470","url":null,"abstract":"With the increasing popularity of geo-positioning technologies and mobile Internet, spatial data query services have attracted extensive attention. To protect the confidentiality of sensitive information outsourced to cloud servers, much efforts have been devoted to designing geometric range query schemes over encrypted spatial data without affecting availability. However, existing works focus on the privacy-preserving schemes with traditional tree indexes, causing more computing and storage issues. In this paper, we propose an efficient conjunctive geometric range query scheme over encrypted spatial data with a learned index. In particular, we design a new privacy-preserving learned index for spatial data to reduce the search space and storage overhead. The main idea is to add noise disturbance to the objective function instead of directly adding it to output results, reducing the leakage of private information and ensuring the correctness of output results. Moreover, we propose a spatial segmentation algorithm to avoid accessing a large number of unnecessary Z codes in the query process. The formal security analysis shows that our scheme ensures index data security and query privacy. Simulation results show that the query efficiency is improved while the storage overhead is significantly reduced compared with the state-of-the-art schemes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3995-4009"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145456008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Metric Relationship Between Extra Connectivity and Extra Diagnosability of Multiprocessor Systems 多处理器系统额外连通性与额外可诊断性的度量关系
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604468
Yifan Li;Shuming Zhou;Sun-Yuan Hsieh;Qifan Zhang
As multiprocessor systems scale up, $h$-extra connectivity and $h$-extra diagnosability serve as two pivotal metrics for assessing the reliability of the underlying interconnection networks. To ensure that each component of the survival graph holds no fewer than $h + 1$ vertices, the $h$-extra connectivity and $h$-extra diagnosability have been proposed to characterize the fault tolerability and self-diagnosing capability of networks, respectively. Many efforts have been made to establish the quantifiable relationship between these metrics but it is less than optimal. This work addresses the flaws of the existing results and proposes a novel proof to determine the metric relationship between $h$-extra connectivity and $h$-extra diagnosability under the PMC and MM* models. Our approach overcomes the defect of previous results by abandoning the network’s regularity and independence number. Furthermore, we apply the suggested metric to establish the $h$-extra diagnosability of a new network class, named generalized exchanged X-cube-like network $GEXC(s,t)$, which takes dual-cube-like network, generalized exchanged hypercube, generalized exchanged crossed cube, and locally generalized exchanged twisted cube as special cases. Finally, we propose the $h$-extra diagnosis strategy ($h$-EDS) and design two self-diagnosis algorithms AhED-PMC and AhED-MM*, and then conduct experiments on $GEXC(s,t)$ and the real-world network DD-$g648$ to show the high accuracy and superior performance of the proposed algorithms.
随着多处理器系统的扩展,额外的连接性和额外的可诊断性是评估底层互连网络可靠性的两个关键指标。为了保证存活图的每个组成部分拥有不少于$h + 1$个顶点,分别提出$h$-额外连通性和$h$-额外可诊断性来表征网络的容错性和自诊断能力。人们已经做出了许多努力来建立这些指标之间的可量化关系,但这还不够理想。这项工作解决了现有结果的缺陷,并提出了一种新的证明,以确定PMC和MM*模型下$h$-额外连通性和$h$-额外可诊断性之间的度量关系。我们的方法摒弃了网络的正则性和独立性,克服了以往结果的缺陷。在此基础上,以双立方体网络、广义交换超立方体网络、广义交换交叉立方体网络和局部广义交换扭曲立方体网络为特例,建立了广义交换x -类立方体网络$GEXC(s,t)$的h -额外可诊断性。最后,我们提出了$h$-额外诊断策略($h$- eds),并设计了两种自诊断算法a赫德- pmc和a赫德- mm *,然后在$GEXC(s,t)$和现实网络DD-$g648$上进行了实验,证明了所提出算法的高精度和优越的性能。
{"title":"The Metric Relationship Between Extra Connectivity and Extra Diagnosability of Multiprocessor Systems","authors":"Yifan Li;Shuming Zhou;Sun-Yuan Hsieh;Qifan Zhang","doi":"10.1109/TC.2025.3604468","DOIUrl":"https://doi.org/10.1109/TC.2025.3604468","url":null,"abstract":"As multiprocessor systems scale up, <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra connectivity and <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability serve as two pivotal metrics for assessing the reliability of the underlying interconnection networks. To ensure that each component of the survival graph holds no fewer than <inline-formula><tex-math>$h + 1$</tex-math></inline-formula> vertices, the <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra connectivity and <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability have been proposed to characterize the fault tolerability and self-diagnosing capability of networks, respectively. Many efforts have been made to establish the quantifiable relationship between these metrics but it is less than optimal. This work addresses the flaws of the existing results and proposes a novel proof to determine the metric relationship between <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra connectivity and <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability under the PMC and MM<sup>*</sup> models. Our approach overcomes the defect of previous results by abandoning the network’s regularity and independence number. Furthermore, we apply the suggested metric to establish the <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability of a new network class, named generalized exchanged X-cube-like network <inline-formula><tex-math>$GEXC(s,t)$</tex-math></inline-formula>, which takes dual-cube-like network, generalized exchanged hypercube, generalized exchanged crossed cube, and locally generalized exchanged twisted cube as special cases. Finally, we propose the <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosis strategy (<inline-formula><tex-math>$h$</tex-math></inline-formula>-EDS) and design two self-diagnosis algorithms AhED-PMC and AhED-MM<sup>*</sup>, and then conduct experiments on <inline-formula><tex-math>$GEXC(s,t)$</tex-math></inline-formula> and the real-world network DD-<inline-formula><tex-math>$g648$</tex-math></inline-formula> to show the high accuracy and superior performance of the proposed algorithms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3860-3872"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-Power Multiplier Designs by Leveraging Correlations of 2$times$×2 Encoded Partial Products 利用2$times$×2编码部分产品的相关性设计低功耗乘法器
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604478
Ao Liu;Siting Liu;Hui Wang;Qin Wang;Fabrizio Lombardi;Zhigang Mao;Honglan Jiang
Multipliers, particularly those with small bit widths, are essential for modern neural network (NN) applications. In addition, multiple-precision multipliers are in high demand for efficient NN accelerators; therefore, recursive multipliers used in low-precision fusion schemes are gaining increasing attention. In this work, we design exact recursive multipliers based on customized approximate full adders (AFAs) for low-power purposes. Initially, the partial products (PPs) encoded by 2$times$2 multiplications are analyzed, which reveals the correlations among adjacent PPs. Based on these correlations, we propose 4$times$4 recursive multiplier architectures where certain full adders (FAs) can be simplified without affecting the correctness of the multiplication. Manually and synthesis tool-based FA simplifications are performed separately. The obtained 4$times$4 multipliers are then used to construct 8$times$8 multipliers based on a low-power recursive architecture. Finally, the proposed signed and unsigned 4$times$4 and 8$times$8 multipliers are evaluated using a 28nm CMOS technology. Compared with DesignWare (DW) multipliers, the proposed signed and unsigned 4$times$4 multipliers achieve power reductions of 16.5% and 11.6%, respectively, without compromising area or delay; alternatively, the delay can be reduced by 20.9% and 39.4%, respectively, without compromising power or area. For signed and unsigned 8$times$8 multipliers, the maximum power reductions are 9.7% and 13.7%, respectively, albeit with a trade-off in area.
乘法器,特别是那些具有小比特宽度的乘法器,对于现代神经网络(NN)的应用是必不可少的。此外,多精度乘法器对高效的神经网络加速器有很高的需求;因此,递归乘法器在低精度融合方案中的应用越来越受到关注。在这项工作中,我们设计了基于自定义近似全加法器(AFAs)的精确递归乘法器,用于低功耗目的。首先,分析了由2$乘以$2乘法编码的部分积(PPs),揭示了相邻PPs之间的相关性。基于这些相关性,我们提出了4$ × $4递归乘法器架构,其中某些全加法器(fa)可以简化而不影响乘法的正确性。手动和基于合成工具的FA简化分别执行。得到的4个$乘以$4乘法器然后用于基于低功耗递归架构构造8个$乘以$8乘法器。最后,使用28nm CMOS技术对所提出的有符号和无符号4$times$4和8$times$8乘法器进行了评估。与DesignWare (DW)乘法器相比,所提出的有符号和无符号4$times$4乘法器在不影响面积或延迟的情况下,功耗分别降低了16.5%和11.6%;另外,在不影响功率或面积的情况下,延迟可以分别减少20.9%和39.4%。对于有符号和无符号的8$乘以$8乘数器,尽管在面积上有所取舍,但最大功耗降低分别为9.7%和13.7%。
{"title":"Low-Power Multiplier Designs by Leveraging Correlations of 2$times$×2 Encoded Partial Products","authors":"Ao Liu;Siting Liu;Hui Wang;Qin Wang;Fabrizio Lombardi;Zhigang Mao;Honglan Jiang","doi":"10.1109/TC.2025.3604478","DOIUrl":"https://doi.org/10.1109/TC.2025.3604478","url":null,"abstract":"Multipliers, particularly those with small bit widths, are essential for modern neural network (NN) applications. In addition, multiple-precision multipliers are in high demand for efficient NN accelerators; therefore, recursive multipliers used in low-precision fusion schemes are gaining increasing attention. In this work, we design exact recursive multipliers based on customized approximate full adders (AFAs) for low-power purposes. Initially, the partial products (PPs) encoded by 2<inline-formula><tex-math>$times$</tex-math></inline-formula>2 multiplications are analyzed, which reveals the correlations among adjacent PPs. Based on these correlations, we propose 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 recursive multiplier architectures where certain full adders (FAs) can be simplified without affecting the correctness of the multiplication. Manually and synthesis tool-based FA simplifications are performed separately. The obtained 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 multipliers are then used to construct 8<inline-formula><tex-math>$times$</tex-math></inline-formula>8 multipliers based on a low-power recursive architecture. Finally, the proposed signed and unsigned 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 and 8<inline-formula><tex-math>$times$</tex-math></inline-formula>8 multipliers are evaluated using a 28nm CMOS technology. Compared with DesignWare (DW) multipliers, the proposed signed and unsigned 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 multipliers achieve power reductions of 16.5% and 11.6%, respectively, without compromising area or delay; alternatively, the delay can be reduced by 20.9% and 39.4%, respectively, without compromising power or area. For signed and unsigned 8<inline-formula><tex-math>$times$</tex-math></inline-formula>8 multipliers, the maximum power reductions are 9.7% and 13.7%, respectively, albeit with a trade-off in area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3888-3896"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NetKG: Synthesizing Interpretable Network Router Configurations With Knowledge Graph NetKG:基于知识图的可解释网络路由器配置综合
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-01 DOI: 10.1109/TC.2025.3603712
Zhenbei Guo;Fuliang Li;Peng Zhang;Xingwei Wang;Jiannong Cao
Advanced router configuration synthesizers aim to prevent network outages by automatically synthesizing configurations that implement routing protocols. However, the lack of interpretability makes operators uncertain about how low-level configurations are synthesized and whether the automatically generated configurations correctly align with routing intents. This limitation restricts the practical deployment of synthesizers. In this paper, we present NetKG, an interpretable configuration synthesis tool. $(i)$ NetKG leverages a knowledge graph as the intermediate representation for configurations, reformulating the configuration synthesis problem as a configuration knowledge completion task; $(ii)$ NetKG regards network intents as query tasks that need to be satisfied in the current configuration space, achieving this through knowledge reasoning and completion; $(iii)$ NetKG explains the synthesis process and the consistency between configuration and intent through the configuration knowledge involved in reasoning and completion. We show that NetKG can scale to realistic networks and automatically synthesize intent-compliant configurations for static routes, OSPF, and BGP. It can explain the consistency between configuration and intent at different granularities through a visual interface. Experimental results indicate that NetKG synthesizes configurations in 2 minutes for a network with up to 197 routers, which is 7.37x faster than the SMT-based synthesizer.
高级路由器配置合成器旨在通过自动合成实现路由协议的配置来防止网络中断。然而,缺乏可解释性使得操作人员不确定如何合成低级配置,以及自动生成的配置是否正确地与路由意图一致。这个限制限制了合成器的实际部署。在本文中,我们提出了NetKG,一个可解释的配置合成工具。$(i)$ NetKG利用知识图作为配置的中间表示,将配置综合问题重新表述为配置知识完成任务;$(ii)$ NetKG将网络意图视为当前配置空间中需要满足的查询任务,通过知识推理和补全实现;$(iii)$ NetKG通过推理和完成过程中涉及的配置知识,解释了配置与意图之间的合成过程和一致性。我们展示了NetKG可以扩展到现实网络,并自动合成静态路由、OSPF和BGP的意图兼容配置。它可以通过可视化界面在不同粒度上解释配置和意图之间的一致性。实验结果表明,在最多197台路由器的网络中,NetKG在2分钟内完成配置合成,比基于smt的合成器快7.37倍。
{"title":"NetKG: Synthesizing Interpretable Network Router Configurations With Knowledge Graph","authors":"Zhenbei Guo;Fuliang Li;Peng Zhang;Xingwei Wang;Jiannong Cao","doi":"10.1109/TC.2025.3603712","DOIUrl":"https://doi.org/10.1109/TC.2025.3603712","url":null,"abstract":"Advanced router configuration synthesizers aim to prevent network outages by automatically synthesizing configurations that implement routing protocols. However, the lack of interpretability makes operators uncertain about how low-level configurations are synthesized and whether the automatically generated configurations correctly align with routing intents. This limitation restricts the practical deployment of synthesizers. In this paper, we present NetKG, an interpretable configuration synthesis tool. <inline-formula><tex-math>$(i)$</tex-math></inline-formula> NetKG leverages a knowledge graph as the intermediate representation for configurations, reformulating the configuration synthesis problem as a configuration knowledge completion task; <inline-formula><tex-math>$(ii)$</tex-math></inline-formula> NetKG regards network intents as query tasks that need to be satisfied in the current configuration space, achieving this through knowledge reasoning and completion; <inline-formula><tex-math>$(iii)$</tex-math></inline-formula> NetKG explains the synthesis process and the consistency between configuration and intent through the configuration knowledge involved in reasoning and completion. We show that NetKG can scale to realistic networks and automatically synthesize intent-compliant configurations for static routes, OSPF, and BGP. It can explain the consistency between configuration and intent at different granularities through a visual interface. Experimental results indicate that NetKG synthesizes configurations in 2 minutes for a network with up to 197 routers, which is 7.37x faster than the SMT-based synthesizer.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3722-3735"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Concurrent Linguistic Error Detection (CLED): A New Methodology for Error Detection in Large Language Models 并发语言错误检测:一种大型语言模型中错误检测的新方法
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-01 DOI: 10.1109/TC.2025.3603682
Jinhua Zhu;Javier Conde;Zhen Gao;Pedro Reviriego;Shanshan Liu;Fabrizio Lombardi
The utilization of Large Language Models (LLMs) requires dependable operation in the presence of errors in the hardware (caused by for example radiation) as this has become a pressing concern. At the same time, the scale and complexity of LLMs limit the overhead that can be added to detect errors. Therefore, there is a need for low-cost error detection schemes. Concurrent Error Detection (CED) uses the properties of a system to detect errors, so it is an appealing approach. In this paper, we present a new methodology and scheme for error detection in LLMs: Concurrent Linguistic Error Detection (CLED). Its main principle is that the output of LLMs should be valid and generate coherent text; therefore, when the text is not valid or differs significantly from the normal text, it is likely that there is an error. Hence, errors can potentially be detected by checking the linguistic features of the text generated by LLMs. This has the following main advantages: 1) low overhead as the checks are simple and 2) general applicability, so regardless of the LLM implementation details because the text correctness is not related to the LLM algorithms or implementations. The proposed CLED has been evaluated on two LLMs: T5 and OPUS-MT. The results show that with a 1% overhead, CLED can detect more than 87% of the errors, making it suitable to improve LLM dependability at low cost.
大型语言模型(llm)的使用需要在硬件存在错误(例如由辐射引起)的情况下可靠地运行,因为这已经成为一个紧迫的问题。同时,llm的规模和复杂性限制了为检测错误而增加的开销。因此,需要低成本的错误检测方案。并发错误检测(CED)使用系统的属性来检测错误,因此它是一种吸引人的方法。在本文中,我们提出了一种新的llm错误检测方法和方案:并发语言错误检测(ced)。其主要原则是llm的输出应该是有效的,并产生连贯的文本;因此,当文本无效或与正常文本明显不同时,很可能存在错误。因此,可以通过检查llm生成的文本的语言特征来潜在地检测错误。这有以下主要优点:1)开销低,因为检查很简单;2)普遍适用性,因此与LLM实现细节无关,因为文本正确性与LLM算法或实现无关。在T5和OPUS-MT两种llm上对拟议的cle进行了评估。结果表明,在1%的开销下,cle可以检测到87%以上的错误,这使得它适合以低成本提高LLM的可靠性。
{"title":"Concurrent Linguistic Error Detection (CLED): A New Methodology for Error Detection in Large Language Models","authors":"Jinhua Zhu;Javier Conde;Zhen Gao;Pedro Reviriego;Shanshan Liu;Fabrizio Lombardi","doi":"10.1109/TC.2025.3603682","DOIUrl":"https://doi.org/10.1109/TC.2025.3603682","url":null,"abstract":"The utilization of Large Language Models (LLMs) requires dependable operation in the presence of errors in the hardware (caused by for example radiation) as this has become a pressing concern. At the same time, the scale and complexity of LLMs limit the overhead that can be added to detect errors. Therefore, there is a need for low-cost error detection schemes. Concurrent Error Detection (CED) uses the properties of a system to detect errors, so it is an appealing approach. In this paper, we present a new methodology and scheme for error detection in LLMs: Concurrent Linguistic Error Detection (CLED). Its main principle is that the output of LLMs should be valid and generate coherent text; therefore, when the text is not valid or differs significantly from the normal text, it is likely that there is an error. Hence, errors can potentially be detected by checking the linguistic features of the text generated by LLMs. This has the following main advantages: 1) low overhead as the checks are simple and 2) general applicability, so regardless of the LLM implementation details because the text correctness is not related to the LLM algorithms or implementations. The proposed CLED has been evaluated on two LLMs: T5 and OPUS-MT. The results show that with a 1% overhead, CLED can detect more than 87% of the errors, making it suitable to improve LLM dependability at low cost.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3638-3651"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BaDFL: Mitigating Model Poisoning in Decentralized Federated Learning BaDFL:减轻分散联邦学习中的模型中毒
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-01 DOI: 10.1109/TC.2025.3603683
Yuan Yuan;Anhao Zhou;Xiao Zhang;Yifei Zou;Yangguang Shi;Dongxiao Yu
Decentralized federated learning (DFL) has gained significant attention due to its ability to facilitate collaborative model training without relying on a central server. However, it is highly vulnerable to backdoor attacks, where malicious participants can manipulate model updates to embed hidden functionalities. In this paper, we propose BaDFL, a novel Backdoor Attack defense mechanism for Decentralized Federated Learning. BaDFL enhances robustness by applying strategic model clipping at the local update level. To the best of our knowledge, BaDFL is the first decentralized federated learning algorithm with theoretical guarantees against model poisoning attacks. Specifically, BaDFL achieves an asymptotically optimal convergence rate of $O(frac{1}{sqrt{nT}})$, where $n$ is the number of nodes and $T$ is the maximum communication round number. Furthermore, we provide a comprehensive analysis under two different attack scenarios, showing that BaDFL maintains robustness within a specific defense radius. Extensive experimental results show that, on average, BaDFL can effectively defend against model poisoning within 8 mitigation rounds, with about a 1% drop in accuracy.
分散式联邦学习(DFL)由于其在不依赖中央服务器的情况下促进协作模型训练的能力而获得了极大的关注。然而,它很容易受到后门攻击,恶意参与者可以操纵模型更新来嵌入隐藏的功能。在本文中,我们提出了一种新的用于分散联邦学习的后门攻击防御机制BaDFL。BaDFL通过在本地更新级别应用策略模型裁剪来增强鲁棒性。据我们所知,BaDFL是第一个对模型中毒攻击具有理论保证的分散联邦学习算法。其中,BaDFL的渐近最优收敛速率为$O(frac{1}{sqrt{nT}})$,其中$n$为节点数,$T$为最大通信轮数。此外,我们提供了两种不同攻击场景下的综合分析,表明BaDFL在特定防御半径内保持鲁棒性。大量实验结果表明,平均而言,BaDFL在8轮缓解期内可以有效防御模型中毒,效果约为1% drop in accuracy.
{"title":"BaDFL: Mitigating Model Poisoning in Decentralized Federated Learning","authors":"Yuan Yuan;Anhao Zhou;Xiao Zhang;Yifei Zou;Yangguang Shi;Dongxiao Yu","doi":"10.1109/TC.2025.3603683","DOIUrl":"https://doi.org/10.1109/TC.2025.3603683","url":null,"abstract":"Decentralized federated learning (DFL) has gained significant attention due to its ability to facilitate collaborative model training without relying on a central server. However, it is highly vulnerable to backdoor attacks, where malicious participants can manipulate model updates to embed hidden functionalities. In this paper, we propose BaDFL, a novel Backdoor Attack defense mechanism for Decentralized Federated Learning. BaDFL enhances robustness by applying strategic model clipping at the local update level. To the best of our knowledge, BaDFL is the first decentralized federated learning algorithm with theoretical guarantees against model poisoning attacks. Specifically, BaDFL achieves an asymptotically optimal convergence rate of <inline-formula><tex-math>$O(frac{1}{sqrt{nT}})$</tex-math></inline-formula>, where <inline-formula><tex-math>$n$</tex-math></inline-formula> is the number of nodes and <inline-formula><tex-math>$T$</tex-math></inline-formula> is the maximum communication round number. Furthermore, we provide a comprehensive analysis under two different attack scenarios, showing that BaDFL maintains robustness within a specific defense radius. Extensive experimental results show that, on average, BaDFL can effectively defend against model poisoning within 8 mitigation rounds, with about a 1% drop in accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3968-3979"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link TeraPool:一个物理设计意识,1024 RISC-V核共享l1 -内存扩展集群设计与高带宽主存链路
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-01 DOI: 10.1109/TC.2025.3603692
Yichao Zhang;Marco Bertuletti;Chi Zhang;Samuel Riedel;Diyou Shen;Bowen Wang;Alessandro Vanelli-Coralli;Luca Benini
Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, ${boldsymbol >} 1000$ floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte ${boldsymbol >} 4000$-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 $^{boldsymbol{circ}}$C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1${boldsymbol times}$ the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.
流线型指令处理器(处理元素- pe)的共享l1内存集群通常用作现代大规模并行计算体系结构(例如gp - gpu)的构建块。通过增加集群数量来扩展这些体系结构会导致计算和功率开销,这是因为需要将大型数据结构分割和合并为块,并通过高延迟的全局互连在内存层次结构中移动块。扩展集群可以减少缓冲、复制和同步开销。然而,一个完全连接的核心到l1内存交叉栏的复杂性随着处理元素(PE)的数量呈二次增长,这给物理实现带来了很大的挑战。我们提出了TeraPool,这是一种物理上可实现的,具有1000美元浮点能力的RISC-V pe扩展集群设计,通过低延迟分层互连(1-7/9/11周期,取决于目标频率)共享一个multi -兆字节4000美元的L1内存。TeraPool采用12纳米FinFET技术实现近千兆赫频率(910 MHz),典型频率为0.80 V/25 $^{boldsymbol{circ}}$C。节能的分层pe -to- l1存储器互连仅消耗9-13.5 pJ的存储器访问,仅消耗0.74-1.1 FP32 FMA的成本。高带宽主存储器链路被设计用来管理进出共享L1的数据传输,在HBM2E主存储器的全带宽下维持传输。在910 MHz时,集群在基准内核中提供高达1.89单精度TFLOP/s的峰值性能和高达200 GFLOP/s/W的能效(平均IPC/PE为0.8),证明了将共享l1集群扩展到1000个PE的可行性,这是文献中报道的最大集群的PE数的四倍。
{"title":"TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link","authors":"Yichao Zhang;Marco Bertuletti;Chi Zhang;Samuel Riedel;Diyou Shen;Bowen Wang;Alessandro Vanelli-Coralli;Luca Benini","doi":"10.1109/TC.2025.3603692","DOIUrl":"https://doi.org/10.1109/TC.2025.3603692","url":null,"abstract":"Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). <i>Scaling out</i> these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. <i>Scaling up</i> the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, <inline-formula><tex-math>${boldsymbol &gt;} 1000$</tex-math></inline-formula> floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte <inline-formula><tex-math>${boldsymbol &gt;} 4000$</tex-math></inline-formula>-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 <inline-formula><tex-math>$^{boldsymbol{circ}}$</tex-math></inline-formula>C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1<inline-formula><tex-math>${boldsymbol times}$</tex-math></inline-formula> the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3667-3681"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1