Accurate and fast data stream mining is critical to many tasks, including real-time series analysis for mobile sensor data, big data management and machine learning. Various heavy-oriented item detection tasks, such as identifying heavy hitters, heavy changers, persistent items, and significant items, have garnered considerable attention from both industry and academia. Unfortunately, as data stream speeds continue to increase and the available memory, particularly in L1 cache, remains limited for real-time processing, existing schemes face challenges in simultaneously achieving high detection accuracy, memory efficiency, and fast update throughput, as we reveal. To tackle this conundrum, we propose a versatile and elegant sketch framework named Tight-Sketch, which supports a spectrum of heavy-based detection tasks. Recognizing that, in practice, most items are cold (non-heavy/persistent/significant), we implement distinct eviction strategies for different item types. This approach allows us to swiftly discard potentially cold items while offering enhanced protection to hot ones (heavy/persistent/significant). Additionally, we introduce an eviction method based on stochastic decay, ensuring that Tight-Sketch incurs only small one-sided errors without overestimation. To further enhance detection accuracy under extremely constrained memory allocations, we introduce Tight-Opt, a variant incorporating two optimization strategies. We conduct extensive experiments across various detection tasks to demonstrate that Tight-Sketch significantly outperforms existing methods in terms of both accuracy and update speed. Furthermore, by utilizing Single Instruction Multiple Data (SIMD) instructions, we enhance Tight-Sketch’s update throughput by up to 36%. We also implement Tight-Sketch on FPGA to validate its practicality and low resource overhead in hardware deployments.
{"title":"Efficient Sketching for Heavy Item-Oriented Data Stream Mining With Memory Constraints","authors":"Weihe Li;Paul Patras","doi":"10.1109/TC.2025.3604467","DOIUrl":"https://doi.org/10.1109/TC.2025.3604467","url":null,"abstract":"Accurate and fast data stream mining is critical to many tasks, including real-time series analysis for mobile sensor data, big data management and machine learning. Various heavy-oriented item detection tasks, such as identifying heavy hitters, heavy changers, persistent items, and significant items, have garnered considerable attention from both industry and academia. Unfortunately, as data stream speeds continue to increase and the available memory, particularly in L1 cache, remains limited for real-time processing, existing schemes face challenges in simultaneously achieving high detection accuracy, memory efficiency, and fast update throughput, as we reveal. To tackle this conundrum, we propose a versatile and elegant sketch framework named Tight-Sketch, which supports a spectrum of heavy-based detection tasks. Recognizing that, in practice, most items are cold (non-heavy/persistent/significant), we implement distinct eviction strategies for different item types. This approach allows us to swiftly discard potentially cold items while offering enhanced protection to hot ones (heavy/persistent/significant). Additionally, we introduce an eviction method based on stochastic decay, ensuring that Tight-Sketch incurs only small one-sided errors without overestimation. To further enhance detection accuracy under extremely constrained memory allocations, we introduce Tight-Opt, a variant incorporating two optimization strategies. We conduct extensive experiments across various detection tasks to demonstrate that Tight-Sketch significantly outperforms existing methods in terms of both accuracy and update speed. Furthermore, by utilizing Single Instruction Multiple Data (SIMD) instructions, we enhance Tight-Sketch’s update throughput by up to 36%. We also implement Tight-Sketch on FPGA to validate its practicality and low resource overhead in hardware deployments.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3845-3859"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient resource utilization is crucial in real-world applications, especially for balancing loads across machines handling specific job types. This paper introduces a novel batch-ordered job-store scheduling model, where jobs in a batch are scheduled sequentially, with their operations allocated in a round-robin fashion across two scenarios. We establish that this problem is NP-hard and analyze it in both online and offline settings. In the online case, we first examine the exclusive scenario, where operations within the same job must be scheduled on different machines, and show that a load greedy (LG) algorithm achieves a tight competitive ratio of $2-frac{1}{m}$, with $m$ representing the number of machines. Next, we consider the circular scenario, which requires maintaining the circular order of operations across ordered machines. In this context, we analyze potential anomalies in load distribution during local optimality achieved by the ordered load greedy (OLG) algorithm and provide bounds on the occurrence of these anomalies and the maximum load in each local scheduling round. In the offline case, we abstract each OLG scheduling process as a generalized circular sequence alignment (CSA) problem and develop a dynamic programming-based matching (DPM) algorithm to solve it. To further enhance load balancing, we develop a dynamic programming-based optimization (DPO) algorithm to schedule multiple jobs simultaneously in both scenarios. Experimental results confirm the efficiency of DPM for the CSA problem, and we validate the load balancing effectiveness of both online and offline algorithms using real traffic datasets. These theoretical findings and algorithmic implementations lay a solid groundwork for future practical advancements.
{"title":"Load Balancing Scheduling for Batch-Ordered Job-Store: Online vs. Offline","authors":"Mengbing Zhou;Yang Wang;Bocong Zhao;Chengzhong Xu","doi":"10.1109/TC.2025.3603725","DOIUrl":"https://doi.org/10.1109/TC.2025.3603725","url":null,"abstract":"Efficient resource utilization is crucial in real-world applications, especially for balancing loads across machines handling specific job types. This paper introduces a novel batch-ordered job-store scheduling model, where jobs in a batch are scheduled sequentially, with their operations allocated in a round-robin fashion across two scenarios. We establish that this problem is NP-hard and analyze it in both online and offline settings. In the online case, we first examine the exclusive scenario, where operations within the same job must be scheduled on different machines, and show that a load greedy (LG) algorithm achieves a tight competitive ratio of <inline-formula><tex-math>$2-frac{1}{m}$</tex-math></inline-formula>, with <inline-formula><tex-math>$m$</tex-math></inline-formula> representing the number of machines. Next, we consider the circular scenario, which requires maintaining the circular order of operations across ordered machines. In this context, we analyze potential anomalies in load distribution during local optimality achieved by the ordered load greedy (OLG) algorithm and provide bounds on the occurrence of these anomalies and the maximum load in each local scheduling round. In the offline case, we abstract each OLG scheduling process as a generalized circular sequence alignment (CSA) problem and develop a dynamic programming-based matching (DPM) algorithm to solve it. To further enhance load balancing, we develop a dynamic programming-based optimization (DPO) algorithm to schedule multiple jobs simultaneously in both scenarios. Experimental results confirm the efficiency of DPM for the CSA problem, and we validate the load balancing effectiveness of both online and offline algorithms using real traffic datasets. These theoretical findings and algorithmic implementations lay a solid groundwork for future practical advancements.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3778-3791"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel execution and conditional execution are increasingly prevalent in modern embedded systems. In real-time scheduling, a fundamental problem is how to upper-bound the response times of a task. Recent work applied the multi-path technique to reduce the response time bound for tasks with parallel execution, but left tasks with conditional execution as an open problem. This paper focuses on upper-bounding response times for tasks with both parallel execution and conditional execution using the multi-path technique. By designing a delicate abstraction regarding the multiple paths of various conditional branches, we derive a new response time bound. We further apply this response time bound into the scheduling of multiple parallel tasks with conditional branches. Experiments demonstrate that the proposed bound significantly advances the state-of-the-art, reducing the response time bound by 9.4% and improving the schedulability by 31.2% on average.
{"title":"Multi-Path Bound for Parallel Tasks With Conditional Branches","authors":"Qingqiang He;Nan Guan;Zhe Jiang;Mingsong Lv","doi":"10.1109/TC.2025.3604469","DOIUrl":"https://doi.org/10.1109/TC.2025.3604469","url":null,"abstract":"Parallel execution and conditional execution are increasingly prevalent in modern embedded systems. In real-time scheduling, a fundamental problem is how to upper-bound the response times of a task. Recent work applied the multi-path technique to reduce the response time bound for tasks with parallel execution, but left tasks with conditional execution as an open problem. This paper focuses on upper-bounding response times for tasks with both parallel execution and conditional execution using the multi-path technique. By designing a delicate abstraction regarding the multiple paths of various conditional branches, we derive a new response time bound. We further apply this response time bound into the scheduling of multiple parallel tasks with conditional branches. Experiments demonstrate that the proposed bound significantly advances the state-of-the-art, reducing the response time bound by 9.4% and improving the schedulability by 31.2% on average.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3873-3887"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing popularity of geo-positioning technologies and mobile Internet, spatial data query services have attracted extensive attention. To protect the confidentiality of sensitive information outsourced to cloud servers, much efforts have been devoted to designing geometric range query schemes over encrypted spatial data without affecting availability. However, existing works focus on the privacy-preserving schemes with traditional tree indexes, causing more computing and storage issues. In this paper, we propose an efficient conjunctive geometric range query scheme over encrypted spatial data with a learned index. In particular, we design a new privacy-preserving learned index for spatial data to reduce the search space and storage overhead. The main idea is to add noise disturbance to the objective function instead of directly adding it to output results, reducing the leakage of private information and ensuring the correctness of output results. Moreover, we propose a spatial segmentation algorithm to avoid accessing a large number of unnecessary Z codes in the query process. The formal security analysis shows that our scheme ensures index data security and query privacy. Simulation results show that the query efficiency is improved while the storage overhead is significantly reduced compared with the state-of-the-art schemes.
{"title":"Efficient Conjunctive Geometric Range Query Over Encrypted Spatial Data With Learned Index","authors":"Mingyue Li;Chunfu Jia;Ruizhong Du;Guanxiong Ha","doi":"10.1109/TC.2025.3604470","DOIUrl":"https://doi.org/10.1109/TC.2025.3604470","url":null,"abstract":"With the increasing popularity of geo-positioning technologies and mobile Internet, spatial data query services have attracted extensive attention. To protect the confidentiality of sensitive information outsourced to cloud servers, much efforts have been devoted to designing geometric range query schemes over encrypted spatial data without affecting availability. However, existing works focus on the privacy-preserving schemes with traditional tree indexes, causing more computing and storage issues. In this paper, we propose an efficient conjunctive geometric range query scheme over encrypted spatial data with a learned index. In particular, we design a new privacy-preserving learned index for spatial data to reduce the search space and storage overhead. The main idea is to add noise disturbance to the objective function instead of directly adding it to output results, reducing the leakage of private information and ensuring the correctness of output results. Moreover, we propose a spatial segmentation algorithm to avoid accessing a large number of unnecessary Z codes in the query process. The formal security analysis shows that our scheme ensures index data security and query privacy. Simulation results show that the query efficiency is improved while the storage overhead is significantly reduced compared with the state-of-the-art schemes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3995-4009"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145456008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As multiprocessor systems scale up, $h$-extra connectivity and $h$-extra diagnosability serve as two pivotal metrics for assessing the reliability of the underlying interconnection networks. To ensure that each component of the survival graph holds no fewer than $h + 1$ vertices, the $h$-extra connectivity and $h$-extra diagnosability have been proposed to characterize the fault tolerability and self-diagnosing capability of networks, respectively. Many efforts have been made to establish the quantifiable relationship between these metrics but it is less than optimal. This work addresses the flaws of the existing results and proposes a novel proof to determine the metric relationship between $h$-extra connectivity and $h$-extra diagnosability under the PMC and MM* models. Our approach overcomes the defect of previous results by abandoning the network’s regularity and independence number. Furthermore, we apply the suggested metric to establish the $h$-extra diagnosability of a new network class, named generalized exchanged X-cube-like network $GEXC(s,t)$, which takes dual-cube-like network, generalized exchanged hypercube, generalized exchanged crossed cube, and locally generalized exchanged twisted cube as special cases. Finally, we propose the $h$-extra diagnosis strategy ($h$-EDS) and design two self-diagnosis algorithms AhED-PMC and AhED-MM*, and then conduct experiments on $GEXC(s,t)$ and the real-world network DD-$g648$ to show the high accuracy and superior performance of the proposed algorithms.
随着多处理器系统的扩展,额外的连接性和额外的可诊断性是评估底层互连网络可靠性的两个关键指标。为了保证存活图的每个组成部分拥有不少于$h + 1$个顶点,分别提出$h$-额外连通性和$h$-额外可诊断性来表征网络的容错性和自诊断能力。人们已经做出了许多努力来建立这些指标之间的可量化关系,但这还不够理想。这项工作解决了现有结果的缺陷,并提出了一种新的证明,以确定PMC和MM*模型下$h$-额外连通性和$h$-额外可诊断性之间的度量关系。我们的方法摒弃了网络的正则性和独立性,克服了以往结果的缺陷。在此基础上,以双立方体网络、广义交换超立方体网络、广义交换交叉立方体网络和局部广义交换扭曲立方体网络为特例,建立了广义交换x -类立方体网络$GEXC(s,t)$的h -额外可诊断性。最后,我们提出了$h$-额外诊断策略($h$- eds),并设计了两种自诊断算法a赫德- pmc和a赫德- mm *,然后在$GEXC(s,t)$和现实网络DD-$g648$上进行了实验,证明了所提出算法的高精度和优越的性能。
{"title":"The Metric Relationship Between Extra Connectivity and Extra Diagnosability of Multiprocessor Systems","authors":"Yifan Li;Shuming Zhou;Sun-Yuan Hsieh;Qifan Zhang","doi":"10.1109/TC.2025.3604468","DOIUrl":"https://doi.org/10.1109/TC.2025.3604468","url":null,"abstract":"As multiprocessor systems scale up, <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra connectivity and <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability serve as two pivotal metrics for assessing the reliability of the underlying interconnection networks. To ensure that each component of the survival graph holds no fewer than <inline-formula><tex-math>$h + 1$</tex-math></inline-formula> vertices, the <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra connectivity and <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability have been proposed to characterize the fault tolerability and self-diagnosing capability of networks, respectively. Many efforts have been made to establish the quantifiable relationship between these metrics but it is less than optimal. This work addresses the flaws of the existing results and proposes a novel proof to determine the metric relationship between <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra connectivity and <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability under the PMC and MM<sup>*</sup> models. Our approach overcomes the defect of previous results by abandoning the network’s regularity and independence number. Furthermore, we apply the suggested metric to establish the <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosability of a new network class, named generalized exchanged X-cube-like network <inline-formula><tex-math>$GEXC(s,t)$</tex-math></inline-formula>, which takes dual-cube-like network, generalized exchanged hypercube, generalized exchanged crossed cube, and locally generalized exchanged twisted cube as special cases. Finally, we propose the <inline-formula><tex-math>$h$</tex-math></inline-formula>-extra diagnosis strategy (<inline-formula><tex-math>$h$</tex-math></inline-formula>-EDS) and design two self-diagnosis algorithms AhED-PMC and AhED-MM<sup>*</sup>, and then conduct experiments on <inline-formula><tex-math>$GEXC(s,t)$</tex-math></inline-formula> and the real-world network DD-<inline-formula><tex-math>$g648$</tex-math></inline-formula> to show the high accuracy and superior performance of the proposed algorithms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3860-3872"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ao Liu;Siting Liu;Hui Wang;Qin Wang;Fabrizio Lombardi;Zhigang Mao;Honglan Jiang
Multipliers, particularly those with small bit widths, are essential for modern neural network (NN) applications. In addition, multiple-precision multipliers are in high demand for efficient NN accelerators; therefore, recursive multipliers used in low-precision fusion schemes are gaining increasing attention. In this work, we design exact recursive multipliers based on customized approximate full adders (AFAs) for low-power purposes. Initially, the partial products (PPs) encoded by 2$times$2 multiplications are analyzed, which reveals the correlations among adjacent PPs. Based on these correlations, we propose 4$times$4 recursive multiplier architectures where certain full adders (FAs) can be simplified without affecting the correctness of the multiplication. Manually and synthesis tool-based FA simplifications are performed separately. The obtained 4$times$4 multipliers are then used to construct 8$times$8 multipliers based on a low-power recursive architecture. Finally, the proposed signed and unsigned 4$times$4 and 8$times$8 multipliers are evaluated using a 28nm CMOS technology. Compared with DesignWare (DW) multipliers, the proposed signed and unsigned 4$times$4 multipliers achieve power reductions of 16.5% and 11.6%, respectively, without compromising area or delay; alternatively, the delay can be reduced by 20.9% and 39.4%, respectively, without compromising power or area. For signed and unsigned 8$times$8 multipliers, the maximum power reductions are 9.7% and 13.7%, respectively, albeit with a trade-off in area.
{"title":"Low-Power Multiplier Designs by Leveraging Correlations of 2$times$×2 Encoded Partial Products","authors":"Ao Liu;Siting Liu;Hui Wang;Qin Wang;Fabrizio Lombardi;Zhigang Mao;Honglan Jiang","doi":"10.1109/TC.2025.3604478","DOIUrl":"https://doi.org/10.1109/TC.2025.3604478","url":null,"abstract":"Multipliers, particularly those with small bit widths, are essential for modern neural network (NN) applications. In addition, multiple-precision multipliers are in high demand for efficient NN accelerators; therefore, recursive multipliers used in low-precision fusion schemes are gaining increasing attention. In this work, we design exact recursive multipliers based on customized approximate full adders (AFAs) for low-power purposes. Initially, the partial products (PPs) encoded by 2<inline-formula><tex-math>$times$</tex-math></inline-formula>2 multiplications are analyzed, which reveals the correlations among adjacent PPs. Based on these correlations, we propose 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 recursive multiplier architectures where certain full adders (FAs) can be simplified without affecting the correctness of the multiplication. Manually and synthesis tool-based FA simplifications are performed separately. The obtained 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 multipliers are then used to construct 8<inline-formula><tex-math>$times$</tex-math></inline-formula>8 multipliers based on a low-power recursive architecture. Finally, the proposed signed and unsigned 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 and 8<inline-formula><tex-math>$times$</tex-math></inline-formula>8 multipliers are evaluated using a 28nm CMOS technology. Compared with DesignWare (DW) multipliers, the proposed signed and unsigned 4<inline-formula><tex-math>$times$</tex-math></inline-formula>4 multipliers achieve power reductions of 16.5% and 11.6%, respectively, without compromising area or delay; alternatively, the delay can be reduced by 20.9% and 39.4%, respectively, without compromising power or area. For signed and unsigned 8<inline-formula><tex-math>$times$</tex-math></inline-formula>8 multipliers, the maximum power reductions are 9.7% and 13.7%, respectively, albeit with a trade-off in area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3888-3896"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenbei Guo;Fuliang Li;Peng Zhang;Xingwei Wang;Jiannong Cao
Advanced router configuration synthesizers aim to prevent network outages by automatically synthesizing configurations that implement routing protocols. However, the lack of interpretability makes operators uncertain about how low-level configurations are synthesized and whether the automatically generated configurations correctly align with routing intents. This limitation restricts the practical deployment of synthesizers. In this paper, we present NetKG, an interpretable configuration synthesis tool. $(i)$ NetKG leverages a knowledge graph as the intermediate representation for configurations, reformulating the configuration synthesis problem as a configuration knowledge completion task; $(ii)$ NetKG regards network intents as query tasks that need to be satisfied in the current configuration space, achieving this through knowledge reasoning and completion; $(iii)$ NetKG explains the synthesis process and the consistency between configuration and intent through the configuration knowledge involved in reasoning and completion. We show that NetKG can scale to realistic networks and automatically synthesize intent-compliant configurations for static routes, OSPF, and BGP. It can explain the consistency between configuration and intent at different granularities through a visual interface. Experimental results indicate that NetKG synthesizes configurations in 2 minutes for a network with up to 197 routers, which is 7.37x faster than the SMT-based synthesizer.
{"title":"NetKG: Synthesizing Interpretable Network Router Configurations With Knowledge Graph","authors":"Zhenbei Guo;Fuliang Li;Peng Zhang;Xingwei Wang;Jiannong Cao","doi":"10.1109/TC.2025.3603712","DOIUrl":"https://doi.org/10.1109/TC.2025.3603712","url":null,"abstract":"Advanced router configuration synthesizers aim to prevent network outages by automatically synthesizing configurations that implement routing protocols. However, the lack of interpretability makes operators uncertain about how low-level configurations are synthesized and whether the automatically generated configurations correctly align with routing intents. This limitation restricts the practical deployment of synthesizers. In this paper, we present NetKG, an interpretable configuration synthesis tool. <inline-formula><tex-math>$(i)$</tex-math></inline-formula> NetKG leverages a knowledge graph as the intermediate representation for configurations, reformulating the configuration synthesis problem as a configuration knowledge completion task; <inline-formula><tex-math>$(ii)$</tex-math></inline-formula> NetKG regards network intents as query tasks that need to be satisfied in the current configuration space, achieving this through knowledge reasoning and completion; <inline-formula><tex-math>$(iii)$</tex-math></inline-formula> NetKG explains the synthesis process and the consistency between configuration and intent through the configuration knowledge involved in reasoning and completion. We show that NetKG can scale to realistic networks and automatically synthesize intent-compliant configurations for static routes, OSPF, and BGP. It can explain the consistency between configuration and intent at different granularities through a visual interface. Experimental results indicate that NetKG synthesizes configurations in 2 minutes for a network with up to 197 routers, which is 7.37x faster than the SMT-based synthesizer.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3722-3735"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The utilization of Large Language Models (LLMs) requires dependable operation in the presence of errors in the hardware (caused by for example radiation) as this has become a pressing concern. At the same time, the scale and complexity of LLMs limit the overhead that can be added to detect errors. Therefore, there is a need for low-cost error detection schemes. Concurrent Error Detection (CED) uses the properties of a system to detect errors, so it is an appealing approach. In this paper, we present a new methodology and scheme for error detection in LLMs: Concurrent Linguistic Error Detection (CLED). Its main principle is that the output of LLMs should be valid and generate coherent text; therefore, when the text is not valid or differs significantly from the normal text, it is likely that there is an error. Hence, errors can potentially be detected by checking the linguistic features of the text generated by LLMs. This has the following main advantages: 1) low overhead as the checks are simple and 2) general applicability, so regardless of the LLM implementation details because the text correctness is not related to the LLM algorithms or implementations. The proposed CLED has been evaluated on two LLMs: T5 and OPUS-MT. The results show that with a 1% overhead, CLED can detect more than 87% of the errors, making it suitable to improve LLM dependability at low cost.
{"title":"Concurrent Linguistic Error Detection (CLED): A New Methodology for Error Detection in Large Language Models","authors":"Jinhua Zhu;Javier Conde;Zhen Gao;Pedro Reviriego;Shanshan Liu;Fabrizio Lombardi","doi":"10.1109/TC.2025.3603682","DOIUrl":"https://doi.org/10.1109/TC.2025.3603682","url":null,"abstract":"The utilization of Large Language Models (LLMs) requires dependable operation in the presence of errors in the hardware (caused by for example radiation) as this has become a pressing concern. At the same time, the scale and complexity of LLMs limit the overhead that can be added to detect errors. Therefore, there is a need for low-cost error detection schemes. Concurrent Error Detection (CED) uses the properties of a system to detect errors, so it is an appealing approach. In this paper, we present a new methodology and scheme for error detection in LLMs: Concurrent Linguistic Error Detection (CLED). Its main principle is that the output of LLMs should be valid and generate coherent text; therefore, when the text is not valid or differs significantly from the normal text, it is likely that there is an error. Hence, errors can potentially be detected by checking the linguistic features of the text generated by LLMs. This has the following main advantages: 1) low overhead as the checks are simple and 2) general applicability, so regardless of the LLM implementation details because the text correctness is not related to the LLM algorithms or implementations. The proposed CLED has been evaluated on two LLMs: T5 and OPUS-MT. The results show that with a 1% overhead, CLED can detect more than 87% of the errors, making it suitable to improve LLM dependability at low cost.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3638-3651"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Decentralized federated learning (DFL) has gained significant attention due to its ability to facilitate collaborative model training without relying on a central server. However, it is highly vulnerable to backdoor attacks, where malicious participants can manipulate model updates to embed hidden functionalities. In this paper, we propose BaDFL, a novel Backdoor Attack defense mechanism for Decentralized Federated Learning. BaDFL enhances robustness by applying strategic model clipping at the local update level. To the best of our knowledge, BaDFL is the first decentralized federated learning algorithm with theoretical guarantees against model poisoning attacks. Specifically, BaDFL achieves an asymptotically optimal convergence rate of $O(frac{1}{sqrt{nT}})$, where $n$ is the number of nodes and $T$ is the maximum communication round number. Furthermore, we provide a comprehensive analysis under two different attack scenarios, showing that BaDFL maintains robustness within a specific defense radius. Extensive experimental results show that, on average, BaDFL can effectively defend against model poisoning within 8 mitigation rounds, with about a 1% drop in accuracy.
分散式联邦学习(DFL)由于其在不依赖中央服务器的情况下促进协作模型训练的能力而获得了极大的关注。然而,它很容易受到后门攻击,恶意参与者可以操纵模型更新来嵌入隐藏的功能。在本文中,我们提出了一种新的用于分散联邦学习的后门攻击防御机制BaDFL。BaDFL通过在本地更新级别应用策略模型裁剪来增强鲁棒性。据我们所知,BaDFL是第一个对模型中毒攻击具有理论保证的分散联邦学习算法。其中,BaDFL的渐近最优收敛速率为$O(frac{1}{sqrt{nT}})$,其中$n$为节点数,$T$为最大通信轮数。此外,我们提供了两种不同攻击场景下的综合分析,表明BaDFL在特定防御半径内保持鲁棒性。大量实验结果表明,平均而言,BaDFL在8轮缓解期内可以有效防御模型中毒,效果约为1% drop in accuracy.
{"title":"BaDFL: Mitigating Model Poisoning in Decentralized Federated Learning","authors":"Yuan Yuan;Anhao Zhou;Xiao Zhang;Yifei Zou;Yangguang Shi;Dongxiao Yu","doi":"10.1109/TC.2025.3603683","DOIUrl":"https://doi.org/10.1109/TC.2025.3603683","url":null,"abstract":"Decentralized federated learning (DFL) has gained significant attention due to its ability to facilitate collaborative model training without relying on a central server. However, it is highly vulnerable to backdoor attacks, where malicious participants can manipulate model updates to embed hidden functionalities. In this paper, we propose BaDFL, a novel Backdoor Attack defense mechanism for Decentralized Federated Learning. BaDFL enhances robustness by applying strategic model clipping at the local update level. To the best of our knowledge, BaDFL is the first decentralized federated learning algorithm with theoretical guarantees against model poisoning attacks. Specifically, BaDFL achieves an asymptotically optimal convergence rate of <inline-formula><tex-math>$O(frac{1}{sqrt{nT}})$</tex-math></inline-formula>, where <inline-formula><tex-math>$n$</tex-math></inline-formula> is the number of nodes and <inline-formula><tex-math>$T$</tex-math></inline-formula> is the maximum communication round number. Furthermore, we provide a comprehensive analysis under two different attack scenarios, showing that BaDFL maintains robustness within a specific defense radius. Extensive experimental results show that, on average, BaDFL can effectively defend against model poisoning within 8 mitigation rounds, with about a 1% drop in accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3968-3979"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, ${boldsymbol >} 1000$ floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte ${boldsymbol >} 4000$-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 $^{boldsymbol{circ}}$C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1${boldsymbol times}$ the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.
{"title":"TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link","authors":"Yichao Zhang;Marco Bertuletti;Chi Zhang;Samuel Riedel;Diyou Shen;Bowen Wang;Alessandro Vanelli-Coralli;Luca Benini","doi":"10.1109/TC.2025.3603692","DOIUrl":"https://doi.org/10.1109/TC.2025.3603692","url":null,"abstract":"Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). <i>Scaling out</i> these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. <i>Scaling up</i> the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, <inline-formula><tex-math>${boldsymbol >} 1000$</tex-math></inline-formula> floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte <inline-formula><tex-math>${boldsymbol >} 4000$</tex-math></inline-formula>-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 <inline-formula><tex-math>$^{boldsymbol{circ}}$</tex-math></inline-formula>C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1<inline-formula><tex-math>${boldsymbol times}$</tex-math></inline-formula> the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3667-3681"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}