Pub Date : 2026-06-01Epub Date: 2026-02-09DOI: 10.1016/j.sysarc.2026.103728
Hao Jiang , Lu Lu , Zhihong Liang
Emerging AI accelerators offer strong compute density for HPC workloads, but decoupled execution engines and software-managed memory systems complicate performance portability. This paper studies the memory-bound SYmmetric Matrix–Vector multiplication (SYMV) kernel on Huawei Ascend A2, a heterogeneous architecture with disjoint Cube (AIC) and Vector (AIV) engines. We propose an architecture-adaptive mapping that (i) assigns off-diagonal dense tiles to AIC while keeping diagonal/finalization on AIV, (ii) orchestrates cross-engine execution with a three-stage software pipeline to overlap DMA, compute, and synchronization, and (iii) reduces off-chip matrix-read traffic via symmetry-aware traversal under triangular storage, together with a transpose-free diagonal-tile strategy on AIV. On Ascend A2, the proposed kernel achieves a consistent 1.3–1.6 speedup over the vendor matmul_gemv baseline, and we provide cross-platform context against cuBLAS (A100) and rocBLAS (MI210).
{"title":"An architecture-adaptive optimization strategy for high-performance SYMV on a heterogeneous AI accelerator","authors":"Hao Jiang , Lu Lu , Zhihong Liang","doi":"10.1016/j.sysarc.2026.103728","DOIUrl":"10.1016/j.sysarc.2026.103728","url":null,"abstract":"<div><div>Emerging AI accelerators offer strong compute density for HPC workloads, but decoupled execution engines and software-managed memory systems complicate performance portability. This paper studies the memory-bound SYmmetric Matrix–Vector multiplication (SYMV) kernel on Huawei Ascend A2, a heterogeneous architecture with disjoint Cube (AIC) and Vector (AIV) engines. We propose an architecture-adaptive mapping that (i) assigns off-diagonal dense tiles to AIC while keeping diagonal/finalization on AIV, (ii) orchestrates cross-engine execution with a three-stage software pipeline to overlap DMA, compute, and synchronization, and (iii) reduces off-chip matrix-read traffic via symmetry-aware traversal under triangular storage, together with a transpose-free diagonal-tile strategy on AIV. On Ascend A2, the proposed kernel achieves a consistent 1.3<span><math><mo>×</mo></math></span>–1.6<span><math><mo>×</mo></math></span> speedup over the vendor <span>matmul_gemv</span> baseline, and we provide cross-platform context against cuBLAS (A100) and rocBLAS (MI210).</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103728"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-02-13DOI: 10.1016/j.sysarc.2026.103740
Mahin Moradiyan, Yasser Sedaghat
Mixed-criticality embedded systems (MCSs) support safety-critical applications by managing tasks with different criticality levels under strict timing constraints. Traditional scheduling prioritizes high-criticality tasks by suspending or degrading low-criticality ones during faults or overruns, often leading to inefficient resource use and reduced quality of service. Various heterogeneous platforms support MCSs needs, with ARM big.LITTLE offering a balanced mix of performance, energy efficiency, and real-time reliability. This paper introduces DFTS-MCS, a dynamic fault-tolerant scheduling method for ARM big.LITTLE platforms to address these challenges. DFTS-MCS method includes three phases: (1) reliability-driven task mapping; (2) adaptive task allocation; and (3) a dynamic fault-tolerant execution model. Results show that compared to state-of-the-art methods, DFTS-MCS achieves the highest high-criticality task success rate (94.1 %) and reduces missed deadlines by up to 40 %. DFTS-MCS recovers tasks 1.3 × more effectively than average competing methods, with up to 19 % higher recovery rate over the weakest baseline. It also minimizes fault-induced delays (13.4 ms for HI tasks) and maintains low execution overhead (8.7 % HI, 14.3 % LO). It achieves superior load balancing by assigning up to 84 % of critical computation to big cores. These results validate DFTS-MCS as a scalable and robust solution for real-time MCSs operating under fault-prone and resource-constrained environments.
混合临界嵌入式系统(mcs)通过在严格的时间约束下管理不同临界级别的任务来支持安全关键型应用。传统调度在故障或超限期间通过挂起或降级低临界任务来优先处理高临界任务,这通常导致资源使用效率低下和服务质量下降。各种异构平台支持mcs的需求,以ARM为主。LITTLE提供了性能、能源效率和实时可靠性的平衡组合。本文介绍了DFTS-MCS一种ARM大系统的动态容错调度方法。LITTLE平台应对这些挑战。DFTS-MCS方法包括三个阶段:(1)可靠性驱动任务映射;(2)自适应任务分配;(3)动态容错执行模型。结果表明,与最先进的方法相比,DFTS-MCS实现了最高的高临界任务成功率(94.1%),并将错过的截止日期减少了高达40%。DFTS-MCS恢复任务的效率比平均竞争方法高1.3倍,在最弱基线上的回收率最高可达19%。它还最大限度地减少了故障引起的延迟(HI任务为13.4 ms),并保持了较低的执行开销(HI 8.7%, LO 14.3%)。它通过将高达84%的关键计算分配给大核心来实现卓越的负载平衡。这些结果验证了DFTS-MCS是一种可扩展且强大的解决方案,适用于在易故障和资源受限环境下运行的实时mcs。
{"title":"DFTS-MCS: Dynamic fault-tolerant scheduling for mixed-criticality systems on heterogeneous multi-core processors","authors":"Mahin Moradiyan, Yasser Sedaghat","doi":"10.1016/j.sysarc.2026.103740","DOIUrl":"10.1016/j.sysarc.2026.103740","url":null,"abstract":"<div><div>Mixed-criticality embedded systems (MCSs) support safety-critical applications by managing tasks with different criticality levels under strict timing constraints. Traditional scheduling prioritizes high-criticality tasks by suspending or degrading low-criticality ones during faults or overruns, often leading to inefficient resource use and reduced quality of service. Various heterogeneous platforms support MCSs needs, with ARM big.LITTLE offering a balanced mix of performance, energy efficiency, and real-time reliability. This paper introduces DFTS-MCS, a dynamic fault-tolerant scheduling method for ARM big.LITTLE platforms to address these challenges. DFTS-MCS method includes three phases: (1) reliability-driven task mapping; (2) adaptive task allocation; and (3) a dynamic fault-tolerant execution model. Results show that compared to state-of-the-art methods, DFTS-MCS achieves the highest high-criticality task success rate (94.1 %) and reduces missed deadlines by up to 40 %. DFTS-MCS recovers tasks 1.3 × more effectively than average competing methods, with up to 19 % higher recovery rate over the weakest baseline. It also minimizes fault-induced delays (13.4 ms for HI tasks) and maintains low execution overhead (8.7 % HI, 14.3 % LO). It achieves superior load balancing by assigning up to 84 % of critical computation to big cores. These results validate DFTS-MCS as a scalable and robust solution for real-time MCSs operating under fault-prone and resource-constrained environments.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103740"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-02-26DOI: 10.1016/j.sysarc.2026.103755
Sourav Modak, Ahmet Oğuz Saltık, Anthony Stein
Deep learning-based weed control systems often struggle with limited training data diversity and constrained computational resources, restricting their effectiveness in real-world deployment. To address these limitations, we introduce a Stable Diffusion-based inpainting framework that progressively augments training datasets in 25% increments, up to 200%, enriching both data volume and variability. We systematically evaluate three state-of-the-art object detection architectures, such as large, small, and nano variants of YOLO11 and YOLOv12, along with large RT-DETR models, under three precision settings (FP32, FP16, INT8) using mAP50 and mAP50-95 evaluation metrics. Experiments on NVIDIA Jetson Orin Nano, NVIDIA Jetson AGX Orin, and spo-comm rugged computing unit reveal that quantization consistently reduces latency and memory footprint, with INT8 compression producing the most compact and fastest models. While INT8 often induces accuracy degradation, we show that this loss is significantly minimized by targeted synthetic augmentation. Notably, small YOLO variants trained with augmented data match, and in some cases surpass, the detection performance of their baseline large counterparts, without added model size or inference cost. Furthermore, utilizing the INT8-quantized Stable Diffusion for data generation preserves augmentation benefits on the downstream models while minimizing generation overhead. In combination, these contributions establish a novel training and deployment strategy for embedded AI in the context of weed detection, demonstrating that small YOLO models, INT8 quantization, and targeted synthetic augmentation can jointly deliver higher efficiency without sacrificing accuracy.
基于深度学习的杂草控制系统经常受到有限的训练数据多样性和有限的计算资源的限制,限制了它们在实际部署中的有效性。为了解决这些限制,我们引入了一个基于稳定扩散的喷漆框架,以25%的增量逐步增加训练数据集,最多增加200%,丰富数据量和可变性。我们系统地评估了三种最先进的目标检测架构,如YOLO11和YOLOv12的大型、小型和纳米变体,以及大型RT-DETR模型,在三种精度设置(FP32、FP16、INT8)下使用mAP50和mAP50-95评估指标。在NVIDIA Jetson Orin Nano, NVIDIA Jetson AGX Orin和spo-comm坚固型计算单元上的实验表明,量化可以持续减少延迟和内存占用,INT8压缩产生最紧凑和最快的模型。虽然INT8通常会导致精度下降,但我们发现,通过有针对性的合成增强,这种损失可以显著降低。值得注意的是,使用增强数据训练的小型YOLO变体在不增加模型大小或推理成本的情况下,可以匹配并在某些情况下超过其基线大型对应的检测性能。此外,利用int8量化的稳定扩散进行数据生成,可以在最小化生成开销的同时保持下游模型的增强效益。总之,这些贡献为杂草检测背景下的嵌入式AI建立了一种新的训练和部署策略,表明小型YOLO模型、INT8量化和有针对性的合成增强可以在不牺牲准确性的情况下共同提供更高的效率。
{"title":"Evaluating model quantization in a GenAI-enhanced weed detection pipeline","authors":"Sourav Modak, Ahmet Oğuz Saltık, Anthony Stein","doi":"10.1016/j.sysarc.2026.103755","DOIUrl":"10.1016/j.sysarc.2026.103755","url":null,"abstract":"<div><div>Deep learning-based weed control systems often struggle with limited training data diversity and constrained computational resources, restricting their effectiveness in real-world deployment. To address these limitations, we introduce a Stable Diffusion-based inpainting framework that progressively augments training datasets in 25% increments, up to 200%, enriching both data volume and variability. We systematically evaluate three state-of-the-art object detection architectures, such as large, small, and nano variants of YOLO11 and YOLOv12, along with large RT-DETR models, under three precision settings (FP32, FP16, INT8) using mAP50 and mAP50-95 evaluation metrics. Experiments on NVIDIA Jetson Orin Nano, NVIDIA Jetson AGX Orin, and spo-comm rugged computing unit reveal that quantization consistently reduces latency and memory footprint, with INT8 compression producing the most compact and fastest models. While INT8 often induces accuracy degradation, we show that this loss is significantly minimized by targeted synthetic augmentation. Notably, small YOLO variants trained with augmented data match, and in some cases surpass, the detection performance of their baseline large counterparts, without added model size or inference cost. Furthermore, utilizing the INT8-quantized Stable Diffusion for data generation preserves augmentation benefits on the downstream models while minimizing generation overhead. In combination, these contributions establish a novel training and deployment strategy for embedded AI in the context of weed detection, demonstrating that small YOLO models, INT8 quantization, and targeted synthetic augmentation can jointly deliver higher efficiency without sacrificing accuracy.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103755"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-02-23DOI: 10.1016/j.sysarc.2026.103752
Pablo Rodríguez , Javier Mateos-Bravo , Sergio Laso , Juan Luis Herrera , Javier Berrocal
Distributing microservices across the Computing Continuum reduces latency and preserves data locality but introduces management complexity on heterogeneous, resource-constrained edge nodes. Traditional reactive orchestration triggers only after saturation occurs. Under bursty or high-density workloads, this latency leads to service degradation, instability, and inefficient energy usage. To address this, the Adaptive Resource-Aware Predictive Orchestrator (ARAPO) couples per-service local forecasting with calibrated node-level aggregation. It employs a dual-threshold policy based on predicted and observed load to trigger migrations. It maps CPU forecasts to power for energy-aware placement without external instrumentation. ARAPO is evaluated in a realistic hospital reference scenario against a reactive-only baseline. Results demonstrate that the system anticipates saturation and prevents control plane congestion. It significantly improves stability in oscillating workloads. Overload time drops from 28.4% to 4.5%. Consequently, energy usage during overload falls to 14.9% of the reactive baseline. Node-level forecasting achieves up to 0.86. The power model tracks consumption with a mean absolute error as low as . This validates its suitability as a lightweight, energy-efficient controller.
{"title":"Predictively controlling the computing continuum with distributed energy-aware orchestration","authors":"Pablo Rodríguez , Javier Mateos-Bravo , Sergio Laso , Juan Luis Herrera , Javier Berrocal","doi":"10.1016/j.sysarc.2026.103752","DOIUrl":"10.1016/j.sysarc.2026.103752","url":null,"abstract":"<div><div>Distributing microservices across the Computing Continuum reduces latency and preserves data locality but introduces management complexity on heterogeneous, resource-constrained edge nodes. Traditional reactive orchestration triggers only after saturation occurs. Under bursty or high-density workloads, this latency leads to service degradation, instability, and inefficient energy usage. To address this, the Adaptive Resource-Aware Predictive Orchestrator (ARAPO) couples per-service local forecasting with calibrated node-level aggregation. It employs a dual-threshold policy based on predicted and observed load to trigger migrations. It maps CPU forecasts to power for energy-aware placement without external instrumentation. ARAPO is evaluated in a realistic hospital reference scenario against a reactive-only baseline. Results demonstrate that the system anticipates saturation and prevents control plane congestion. It significantly improves stability in oscillating workloads. Overload time drops from 28.4% to 4.5%. Consequently, energy usage during overload falls to 14.9% of the reactive baseline. Node-level forecasting achieves <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> up to 0.86. The power model tracks consumption with a mean absolute error as low as <span><math><mrow><mn>0</mn><mo>.</mo><mn>40</mn><mspace></mspace><mi>W</mi></mrow></math></span>. This validates its suitability as a lightweight, energy-efficient controller.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103752"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-02-10DOI: 10.1016/j.sysarc.2026.103737
Jing Zhang , Mingzhang Duan , Peiyu Li , Lei Shen , Chang Cai
With the rapid development of the commercial space industry, the scale of graph-structured data is expected to grow significantly. Traditional deep neural networks face challenges in processing such data due to limitations in feature extraction and information propagation, driving research into Graph Convolutional Networks. Although advanced AI edge platforms offer high computational efficiency, they remain vulnerable to single-event upsets and face resource constraints when implementing redundancy for high-reliability designs. This paper presents an underlying circuit partitioning strategy and a node sensitivity analysis framework, where circuit nodes, defined as fine-grained sub-units obtained by further partitioning coarse-grained modules, are mapped to physical locations and the resulting mapping is integrated into fault analysis. Unlike coarse-grained hardening methods that overlook node-level sensitivities, the proposed approach allows for precise node-level sensitivity ranking, enabling fine-grained hardening where most needed. Experimental results demonstrate that the proposed strategy achieves fault tolerance comparable to full triple modular redundancy, while delivering improvements in resource hardening efficiency of 1.57 ×, 1.67 ×, and 1.76 ×, and improvements in timing hardening efficiency of 1.36 ×, 1.44 ×, and 1.52 × across the three datasets. Compared to coarse-grained methods, it outperforms in hardening efficiency with only a 1.57 × resource overhead and a minimal 15.9% reduction in worst negative slack.
{"title":"Fine-grained sensitive node hardening for graph convolutional network systems","authors":"Jing Zhang , Mingzhang Duan , Peiyu Li , Lei Shen , Chang Cai","doi":"10.1016/j.sysarc.2026.103737","DOIUrl":"10.1016/j.sysarc.2026.103737","url":null,"abstract":"<div><div>With the rapid development of the commercial space industry, the scale of graph-structured data is expected to grow significantly. Traditional deep neural networks face challenges in processing such data due to limitations in feature extraction and information propagation, driving research into Graph Convolutional Networks. Although advanced AI edge platforms offer high computational efficiency, they remain vulnerable to single-event upsets and face resource constraints when implementing redundancy for high-reliability designs. This paper presents an underlying circuit partitioning strategy and a node sensitivity analysis framework, where circuit nodes, defined as fine-grained sub-units obtained by further partitioning coarse-grained modules, are mapped to physical locations and the resulting mapping is integrated into fault analysis. Unlike coarse-grained hardening methods that overlook node-level sensitivities, the proposed approach allows for precise node-level sensitivity ranking, enabling fine-grained hardening where most needed. Experimental results demonstrate that the proposed strategy achieves fault tolerance comparable to full triple modular redundancy, while delivering improvements in resource hardening efficiency of 1.57 ×, 1.67 ×, and 1.76 ×, and improvements in timing hardening efficiency of 1.36 ×, 1.44 ×, and 1.52 × across the three datasets. Compared to coarse-grained methods, it outperforms in hardening efficiency with only a 1.57 × resource overhead and a minimal 15.9% reduction in worst negative slack.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103737"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-02-12DOI: 10.1016/j.sysarc.2026.103739
Zeping Zhang , Jie Huang , Changhao Ding
Model Inversion Attack aims to reconstruct private images from their feature vectors. Existing attacks are usually performed by training a reconstruction model over the feature vectors. As the feature vectors outputted by different target models have different distributions and dimensions, a new reconstruction model has to be trained for each target model. Thus, the flexibility of the attack is usually limited. This paper aims to improve the flexibility of model inversion attacks against face classifiers. The relationship between training-based MIA and auto-encoder is studied, and the challenges to improve the flexibility of inversion attacks are analyzed. To improve the flexibility of the attack, Mapping-MIA is proposed. Mapping-MIA consists of a Data Reconstruction Model to reconstruct faces and their soft biometric attributes. This model can be reused for future inversion tasks. Mapping-MIA also contains a lightweight Feature Mapping Model to map the feature vectors from the outputted space of each target model to the latent space of the Data Reconstruction Model. Experimental results show that Mapping-MIA is more flexible against different target models. It achieves similar or better results than existing methods. Further, the reconstructed soft biometric attributes also have an average accuracy of 86.63% on the private dataset.
{"title":"Flexible Model Inversion Attack with soft biometric attribute reconstruction against face classifiers","authors":"Zeping Zhang , Jie Huang , Changhao Ding","doi":"10.1016/j.sysarc.2026.103739","DOIUrl":"10.1016/j.sysarc.2026.103739","url":null,"abstract":"<div><div>Model Inversion Attack aims to reconstruct private images from their feature vectors. Existing attacks are usually performed by training a reconstruction model over the feature vectors. As the feature vectors outputted by different target models have different distributions and dimensions, a new reconstruction model has to be trained for each target model. Thus, the flexibility of the attack is usually limited. This paper aims to improve the flexibility of model inversion attacks against face classifiers. The relationship between training-based MIA and auto-encoder is studied, and the challenges to improve the flexibility of inversion attacks are analyzed. To improve the flexibility of the attack, Mapping-MIA is proposed. Mapping-MIA consists of a Data Reconstruction Model to reconstruct faces and their soft biometric attributes. This model can be reused for future inversion tasks. Mapping-MIA also contains a lightweight Feature Mapping Model to map the feature vectors from the outputted space of each target model to the latent space of the Data Reconstruction Model. Experimental results show that Mapping-MIA is more flexible against different target models. It achieves similar or better results than existing methods. Further, the reconstructed soft biometric attributes also have an average accuracy of 86.63% on the private dataset.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103739"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-03-03DOI: 10.1016/j.sysarc.2026.103757
Jan Sass , André Brinkmann , Matias Bjørling , Xubin He , Reza Salkhordeh
Solid State Disks (SSDs) utilize NAND flash for data storage. Due to the physical characteristics of NAND, host systems would require extensive modifications in order to use flash storage directly. Instead, a firmware component of the SSD, the Flash Translation Layer (FTL), enables host systems to utilize flash storage without modification. However, the FTL performs its own data placement, requiring address translation and garbage collection, leading to performance unpredictability and performance and hardware overheads, as well as an increased cost for flash storage.
The Zoned Namespaces (ZNS) specification defines a novel interface for the host to interact with flash that avoids interfacing with the Flash Translation Layer and its shortcomings. In order to use the ZNS interface, a considerable amount of modification on the storage stack of the host is required, which is why F2FS is the only stable file system with ZNS support today. In this paper, we present the host-side Zoned Translation Layer (ZTL) and extend our previous work on ZTL by providing additional experiments and implementation details. ZTL provides abstractions and functionalities required by many file systems to support ZNS devices. We demonstrate the feasibility of ZTL by providing the first EXT4 implementation for ZNS devices and by comparing our implementation of ZNS support for F2FS with the native ZNS support of F2FS, showing that ZTL decreases implementation overheads for file system developers while performance is sustained or improved.
{"title":"ZTL: A block layer ZNS driver","authors":"Jan Sass , André Brinkmann , Matias Bjørling , Xubin He , Reza Salkhordeh","doi":"10.1016/j.sysarc.2026.103757","DOIUrl":"10.1016/j.sysarc.2026.103757","url":null,"abstract":"<div><div>Solid State Disks (SSDs) utilize NAND flash for data storage. Due to the physical characteristics of NAND, host systems would require extensive modifications in order to use flash storage directly. Instead, a firmware component of the SSD, the Flash Translation Layer (FTL), enables host systems to utilize flash storage without modification. However, the FTL performs its own data placement, requiring address translation and garbage collection, leading to performance unpredictability and performance and hardware overheads, as well as an increased cost for flash storage.</div><div>The Zoned Namespaces (ZNS) specification defines a novel interface for the host to interact with flash that avoids interfacing with the Flash Translation Layer and its shortcomings. In order to use the ZNS interface, a considerable amount of modification on the storage stack of the host is required, which is why F2FS is the only stable file system with ZNS support today. In this paper, we present the host-side Zoned Translation Layer (ZTL) and extend our previous work on ZTL by providing additional experiments and implementation details. ZTL provides abstractions and functionalities required by many file systems to support ZNS devices. We demonstrate the feasibility of ZTL by providing the first EXT4 implementation for ZNS devices and by comparing our implementation of ZNS support for F2FS with the native ZNS support of F2FS, showing that ZTL decreases implementation overheads for file system developers while performance is sustained or improved.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103757"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-03-04DOI: 10.1016/j.sysarc.2026.103759
Mengxi Wang , Chunmei Dong , Weihao Su , Chengyao Peng , Haiming Chen
Enhanced regular expressions (EREs), which extend classical regular expressions with shuffle and counting operators, offer exponentially more succinct representations of regular languages. However, unconstrained EREs lack explicit algorithms for solving the membership, -non-emptiness, and -non-empty complement problems. In this paper, we introduce a derivative construction for counting and shuffle operators and formally prove its correctness. We also analyze its time complexity based on a lemma that relates the size of the derivative to that of the original expression. Using this derivative, we propose three algorithms to address the membership, -non-emptiness, and -non-empty complement problems for EREs. We conduct experiments demonstrating that these algorithms are both effective and practical. Finally, we validate the correctness of two existing inference algorithms that previously lacked formal guarantees, owing to the absence of practical membership algorithms for unconstrained EREs.
{"title":"Derivative-based algorithms for membership, k-non-emptiness, and k-non-empty complement problems in enhanced regular expressions","authors":"Mengxi Wang , Chunmei Dong , Weihao Su , Chengyao Peng , Haiming Chen","doi":"10.1016/j.sysarc.2026.103759","DOIUrl":"10.1016/j.sysarc.2026.103759","url":null,"abstract":"<div><div>Enhanced regular expressions (EREs), which extend classical regular expressions with shuffle and counting operators, offer exponentially more succinct representations of regular languages. However, unconstrained EREs lack explicit algorithms for solving the membership, <span><math><mi>k</mi></math></span>-non-emptiness, and <span><math><mi>k</mi></math></span>-non-empty complement problems. In this paper, we introduce a derivative construction for counting and shuffle operators and formally prove its correctness. We also analyze its time complexity based on a lemma that relates the size of the derivative to that of the original expression. Using this derivative, we propose three algorithms to address the membership, <span><math><mi>k</mi></math></span>-non-emptiness, and <span><math><mi>k</mi></math></span>-non-empty complement problems for EREs. We conduct experiments demonstrating that these algorithms are both effective and practical. Finally, we validate the correctness of two existing inference algorithms that previously lacked formal guarantees, owing to the absence of practical membership algorithms for unconstrained EREs.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103759"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-03-04DOI: 10.1016/j.sysarc.2026.103762
Zipan Tang , Yixuan Yan , Rongchen Li , Hanze Dong , Bo Sun , Haiming Chen , Hongyu Gao
Real-world regular expressions (regexes) are widely used in practice. However, due to their complex syntax and difficulty in both understanding and writing, automatic synthesis of regexes has been an important research challenge. Existing methods often have limited generalization ability and insufficient support for extended features. To address these challenges, we propose PowerSyn, a framework that leverages large language models (LLMs) and semantic manipulation of sub-expressions. PowerSyn synthesizes regexes from natural language descriptions and examples, and supports extended features. Specifically, our approach includes prompt design for synthesizing regexes with LLMs, as well as a novel algorithm for semantic manipulation of sub-expressions guided by examples and matching relationships. In addition, we explore the ability of LLMs to repair incorrect regexes. The experimental results demonstrate the significant effectiveness of our approach.
{"title":"Multi-modal regular expression synthesis method based on large language models and semantics","authors":"Zipan Tang , Yixuan Yan , Rongchen Li , Hanze Dong , Bo Sun , Haiming Chen , Hongyu Gao","doi":"10.1016/j.sysarc.2026.103762","DOIUrl":"10.1016/j.sysarc.2026.103762","url":null,"abstract":"<div><div>Real-world regular expressions (regexes) are widely used in practice. However, due to their complex syntax and difficulty in both understanding and writing, automatic synthesis of regexes has been an important research challenge. Existing methods often have limited generalization ability and insufficient support for extended features. To address these challenges, we propose PowerSyn, a framework that leverages large language models (LLMs) and semantic manipulation of sub-expressions. PowerSyn synthesizes regexes from natural language descriptions and examples, and supports extended features. Specifically, our approach includes prompt design for synthesizing regexes with LLMs, as well as a novel algorithm for semantic manipulation of sub-expressions guided by examples and matching relationships. In addition, we explore the ability of LLMs to repair incorrect regexes. The experimental results demonstrate the significant effectiveness of our approach.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103762"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-02-19DOI: 10.1016/j.sysarc.2026.103749
Richard Yang , Heather D. Orser , Kip A. Ludwig , Brandon S. Coventry
Digital implementations of the discrete Fourier transform (DFT) are a mainstay in feature assessment of recorded biopotentials, particularly in the quantification of biomarkers of neurological disease state for adaptive deep brain stimulation. Fast Fourier transform (FFT) algorithms and architectures present a substantial energy demand from onboard batteries in implantable medical devices, necessitating the development of ultra-low energy Fourier transform methods in resource-constrained environments. Numerous FFT architectures aim to optimize energy and resource consumption through computational efficiency; however, prioritizing logic complexity reduction at the expense of additional computations can be equally or more effective. This paper introduces a minimal-architecture single-delay feedback discrete Fourier transform (mSDF-DFT) for use in ultra-low-energy field-programmable gate array applications and demonstrates energy and power improvements over benchmark low-energy DFT and FFT methods. Across the parameter set, we observed 11.1% median resource usage reduction and 5.0% median energy reduction when compared to a gold standard SDF-FFT algorithm and 38.1% median resource reduction and 8.8% median energy reduction when compared to the Goertzel Algorithm. While designed for use in closed-loop deep brain stimulation and medical device implementations, the mSDF-DFT is also easily extendable to any ultra-low-energy embedded application.
{"title":"mSDF-DFT: An ultra-low energy discrete Fourier transform architecture for closed-loop neural sensing","authors":"Richard Yang , Heather D. Orser , Kip A. Ludwig , Brandon S. Coventry","doi":"10.1016/j.sysarc.2026.103749","DOIUrl":"10.1016/j.sysarc.2026.103749","url":null,"abstract":"<div><div>Digital implementations of the discrete Fourier transform (DFT) are a mainstay in feature assessment of recorded biopotentials, particularly in the quantification of biomarkers of neurological disease state for adaptive deep brain stimulation. Fast Fourier transform (FFT) algorithms and architectures present a substantial energy demand from onboard batteries in implantable medical devices, necessitating the development of ultra-low energy Fourier transform methods in resource-constrained environments. Numerous FFT architectures aim to optimize energy and resource consumption through computational efficiency; however, prioritizing logic complexity reduction at the expense of additional computations can be equally or more effective. This paper introduces a minimal-architecture single-delay feedback discrete Fourier transform (mSDF-DFT) for use in ultra-low-energy field-programmable gate array applications and demonstrates energy and power improvements over benchmark low-energy DFT and FFT methods. Across the parameter set, we observed 11.1% median resource usage reduction and 5.0% median energy reduction when compared to a gold standard SDF-FFT algorithm and 38.1% median resource reduction and 8.8% median energy reduction when compared to the Goertzel Algorithm. While designed for use in closed-loop deep brain stimulation and medical device implementations, the mSDF-DFT is also easily extendable to any ultra-low-energy embedded application.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"175 ","pages":"Article 103749"},"PeriodicalIF":4.1,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}