Pub Date : 2025-12-26DOI: 10.1016/j.sysarc.2025.103667
Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu
In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate’s profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as , and preserves ImageNet top-1 accuracy within 2.1% of the dense baseline. The code of our framework is publicly available at https://github.com/wewe5215/AI_template_RVV_backend
{"title":"Efficient column-wise N:M pruning on RISC-V CPU","authors":"Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu","doi":"10.1016/j.sysarc.2025.103667","DOIUrl":"10.1016/j.sysarc.2025.103667","url":null,"abstract":"<div><div>In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate’s profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as <span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span>, and preserves ImageNet top-1 accuracy within 2.1% of the dense baseline. The code of our framework is publicly available at <span><span>https://github.com/wewe5215/AI_template_RVV_backend</span><svg><path></path></svg></span></div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103667"},"PeriodicalIF":4.1,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1016/j.sysarc.2025.103669
Marcel Lütke Dreimann, Olaf Spinczyk
With the growing demand for artificial intelligence and other data-intensive applications, the demand for graphics processing units (GPUs) has also increased. Even though there are many approaches on multiplexing GPUs, none of the approaches known to us enable the operating system to coherently integrate GPU resources alongside CPU resources into a holistic resource management. Due to the history of GPUs, GPU drivers are still a large, isolated part within the driver stack of operating systems. This paper aims to conduct a case study on how a multiplexing solution for GPGPUs could look like, where the OS is able to define scheduling policies for GPGPU tasks and manage GPU memory. OS-controlled GPU memory management can especially be helpful for efficient and safe communication between GPGPU applications. We will discuss and evaluate the architecture of MxGPU, which offers software-based multiplexing of integrated Intel GPUs. MxGPU has a tiny code base, which is a precondition for formal verification approaches and usage in safety-critical environments. Experiments with our prototype show that MxGPU can grant the operating system control over GPU resources while allowing more GPU sessions. Furthermore, MxGPU allows for execution of GPGPU tasks with less latency compared to Linux and enables efficient and safe communication between GPU applications.
{"title":"MxGPU: Efficient and safe communication between GPGPU applications in an OS-controlled GPGPU multiplexing environment","authors":"Marcel Lütke Dreimann, Olaf Spinczyk","doi":"10.1016/j.sysarc.2025.103669","DOIUrl":"10.1016/j.sysarc.2025.103669","url":null,"abstract":"<div><div>With the growing demand for artificial intelligence and other data-intensive applications, the demand for graphics processing units (GPUs) has also increased. Even though there are many approaches on multiplexing GPUs, none of the approaches known to us enable the operating system to coherently integrate GPU resources alongside CPU resources into a holistic resource management. Due to the history of GPUs, GPU drivers are still a large, isolated part within the driver stack of operating systems. This paper aims to conduct a case study on how a multiplexing solution for GPGPUs could look like, where the OS is able to define scheduling policies for GPGPU tasks and manage GPU memory. OS-controlled GPU memory management can especially be helpful for efficient and safe communication between GPGPU applications. We will discuss and evaluate the architecture of <span>MxGPU</span>, which offers software-based multiplexing of integrated Intel GPUs. <span>MxGPU</span> has a tiny code base, which is a precondition for formal verification approaches and usage in safety-critical environments. Experiments with our prototype show that <span>MxGPU</span> can grant the operating system control over GPU resources while allowing more GPU sessions. Furthermore, <span>MxGPU</span> allows for execution of GPGPU tasks with less latency compared to Linux and enables efficient and safe communication between GPU applications.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103669"},"PeriodicalIF":4.1,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning (FL) has garnered significant attention in the Artificial Intelligence of Things (AIoT) domain. It enables collaborative learning across distributed, privacy-sensitive devices without compromising their local data. However, existing research indicates that adversaries can still reconstruct the raw data by the observed gradient, resulting in a privacy breach. To further strengthen privacy in FL, various defense measures have been proposed, ranging from encryption-based and perturbation-based methods to advanced adaptive strategies. However, nearly all such defenses are applied directly to raw data or gradients, where the private information inherently resides. This intrinsic presence of sensitive data inevitably leaves FL vulnerable to privacy leakage. Thus, a new defense that can “erase” private information is urgently needed. In this paper, we propose Shade, a shadow mapping defense framework against gradient inversion attack using generative models in FL. We implement two instances of manifold defense methods based on a generative adversarial networks and diffusion models, ShadeGAN and ShadeDiff. In particular, we first generate alternative shadow data to involve in model training. Subsequently, we construct surrogate model to replace the raw model, eliminating the memory of the raw model. Finally, an optional gradient protection mechanism is provided, which operates by mapping raw gradients to their shadow counterparts. Extensive experiment demonstrates that our scheme can prevent adversaries from reconstructing raw data, effectively reducing the risk of FL privacy disclosure.
{"title":"Thwarting gradient inversion in federated learning via generative shadow mapping defense","authors":"Hui Zhou , Yuling Chen , Zheng Qin , Xin Deng , Ziyu Peng","doi":"10.1016/j.sysarc.2025.103671","DOIUrl":"10.1016/j.sysarc.2025.103671","url":null,"abstract":"<div><div>Federated learning (FL) has garnered significant attention in the Artificial Intelligence of Things (AIoT) domain. It enables collaborative learning across distributed, privacy-sensitive devices without compromising their local data. However, existing research indicates that adversaries can still reconstruct the raw data by the observed gradient, resulting in a privacy breach. To further strengthen privacy in FL, various defense measures have been proposed, ranging from encryption-based and perturbation-based methods to advanced adaptive strategies. However, nearly all such defenses are applied directly to raw data or gradients, where the private information inherently resides. This intrinsic presence of sensitive data inevitably leaves FL vulnerable to privacy leakage. Thus, a new defense that can <em>“erase”</em> private information is urgently needed. In this paper, we propose <em>Shade</em>, a shadow mapping defense framework against gradient inversion attack using generative models in FL. We implement two instances of manifold defense methods based on a generative adversarial networks and diffusion models, <em>ShadeGAN</em> and <em>ShadeDiff</em>. In particular, we first generate alternative shadow data to involve in model training. Subsequently, we construct surrogate model to replace the raw model, eliminating the memory of the raw model. Finally, an optional gradient protection mechanism is provided, which operates by mapping raw gradients to their shadow counterparts. Extensive experiment demonstrates that our scheme can prevent adversaries from reconstructing raw data, effectively reducing the risk of FL privacy disclosure.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103671"},"PeriodicalIF":4.1,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-20DOI: 10.1016/j.sysarc.2025.103668
Mohammad Sabri, Marc Riera, Antonio González
Processing Using Memory (PUM) accelerators have the potential to perform Deep Neural Network (DNN) inference by using arrays of memory cells as computation engines. Among various memory technologies, ReRAM crossbars show promising performance in computing dot-product operations in the analog domain. Nevertheless, the expensive writing procedure of ReRAM cells has led researchers to design accelerators whose crossbars have enough capacity to store the full DNN. Given the tremendous and continuous increase in DNN model sizes, this approach is unfeasible for some networks, or inefficient due to the huge hardware requirements. Those accelerators lack the flexibility to adapt to any given DNN model, facing an adaptability challenge.
To address this issue we introduce ARAS, a cost-effective ReRAM-based accelerator that employs an offline scheduler to adapt different DNNs to the resource-limited hardware. ARAS also overlaps the computation of a layer with the weight writing of several layers to mitigate the high writing latency of ReRAM. Furthermore, ARAS introduces three optimizations aimed at reducing the energy overheads of writing in ReRAM. Our key optimization capitalizes on the observation that DNN weights can be re-encoded to augment their similarity between layers, increasing the amount of bitwise values that are equal or similar when overwriting ReRAM cells and, hence, reducing the amount of energy required to update the cells. Overall, ARAS greatly reduces the ReRAM writing activity. We evaluate ARAS on a popular set of DNNs. ARAS provides up to speedup and 45% energy savings over a baseline PUM accelerator without any optimization. Compared to a TPU-like accelerator, ARAS provides up to speedup and 62% energy savings.
{"title":"ARAS: Adaptive low-cost ReRAM-based accelerator for DNNs","authors":"Mohammad Sabri, Marc Riera, Antonio González","doi":"10.1016/j.sysarc.2025.103668","DOIUrl":"10.1016/j.sysarc.2025.103668","url":null,"abstract":"<div><div>Processing Using Memory (PUM) accelerators have the potential to perform Deep Neural Network (DNN) inference by using arrays of memory cells as computation engines. Among various memory technologies, ReRAM crossbars show promising performance in computing dot-product operations in the analog domain. Nevertheless, the expensive writing procedure of ReRAM cells has led researchers to design accelerators whose crossbars have enough capacity to store the full DNN. Given the tremendous and continuous increase in DNN model sizes, this approach is unfeasible for some networks, or inefficient due to the huge hardware requirements. Those accelerators lack the flexibility to adapt to any given DNN model, facing an <em>adaptability</em> challenge.</div><div>To address this issue we introduce ARAS, a cost-effective ReRAM-based accelerator that employs an offline scheduler to adapt different DNNs to the resource-limited hardware. ARAS also overlaps the computation of a layer with the weight writing of several layers to mitigate the high writing latency of ReRAM. Furthermore, ARAS introduces three optimizations aimed at reducing the energy overheads of writing in ReRAM. Our key optimization capitalizes on the observation that DNN weights can be re-encoded to augment their similarity between layers, increasing the amount of bitwise values that are equal or similar when overwriting ReRAM cells and, hence, reducing the amount of energy required to update the cells. Overall, ARAS greatly reduces the ReRAM writing activity. We evaluate ARAS on a popular set of DNNs. ARAS provides up to <span><math><mrow><mn>2</mn><mo>.</mo><mn>2</mn><mo>×</mo></mrow></math></span> speedup and 45% energy savings over a baseline PUM accelerator without any optimization. Compared to a TPU-like accelerator, ARAS provides up to <span><math><mrow><mn>1</mn><mo>.</mo><mn>5</mn><mo>×</mo></mrow></math></span> speedup and 62% energy savings.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103668"},"PeriodicalIF":4.1,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1016/j.sysarc.2025.103664
Jiepin Ding , Jun Xia , Yutong Ye, Mingsong Chen
Deep Reinforcement Learning (DRL) has been recognized as a promising means for solving the Dynamic Flexible Job Shop Scheduling Problem (DFJSP), where involved jobs have both distinct start time and due dates. However, due to the improper DRL modeling of scheduling components, it is hard to guarantee the quality (e.g., makespan, resource utilization) of job-to-machine dispatching solutions for DFJSP. This is mainly because (i) most existing DRL-based methods design actions as composite rules by combining both the processes of operation sequencing and machine assignment together, which will inevitably limit their adaptability to ever-changing scheduling scenarios; and (ii) without considering knowledge sharing among DRL network nodes, the learned policy networks with bounded sizes cannot be applied to complex large-scale scheduling problems. To address this problem, this paper introduces a novel DRL-based two-stage dispatching method that can effectively solve the DFJSP to achieve scheduling solutions of better quality. In our approach, the first stage utilizes a graph neural network-based policy network to facilitate optimal operation selection at each dispatching point. Since the policy network is size-agnostic and can share knowledge among DRL network nodes through graph embedding, it can handle DFJSP instances of varying scales. For the second stage, by decoupling the dependencies between operations and machines, we propose an effective machine selection heuristic that can derive more dispatching rules to improve the adaptability of DRL to various complex dynamic scheduling scenarios. Comprehensive experimental results demonstrate the superiority of our approach over state-of-the-art methods from both the perspective of scheduling solution quality and the adaptability of learned DRL models.
{"title":"Effective reinforcement learning-based dynamic flexible job shop scheduling using two-stage dispatching","authors":"Jiepin Ding , Jun Xia , Yutong Ye, Mingsong Chen","doi":"10.1016/j.sysarc.2025.103664","DOIUrl":"10.1016/j.sysarc.2025.103664","url":null,"abstract":"<div><div>Deep Reinforcement Learning (DRL) has been recognized as a promising means for solving the Dynamic Flexible Job Shop Scheduling Problem (DFJSP), where involved jobs have both distinct start time and due dates. However, due to the improper DRL modeling of scheduling components, it is hard to guarantee the quality (e.g., makespan, resource utilization) of job-to-machine dispatching solutions for DFJSP. This is mainly because (i) most existing DRL-based methods design actions as composite rules by combining both the processes of operation sequencing and machine assignment together, which will inevitably limit their adaptability to ever-changing scheduling scenarios; and (ii) without considering knowledge sharing among DRL network nodes, the learned policy networks with bounded sizes cannot be applied to complex large-scale scheduling problems. To address this problem, this paper introduces a novel DRL-based two-stage dispatching method that can effectively solve the DFJSP to achieve scheduling solutions of better quality. In our approach, the first stage utilizes a graph neural network-based policy network to facilitate optimal operation selection at each dispatching point. Since the policy network is size-agnostic and can share knowledge among DRL network nodes through graph embedding, it can handle DFJSP instances of varying scales. For the second stage, by decoupling the dependencies between operations and machines, we propose an effective machine selection heuristic that can derive more dispatching rules to improve the adaptability of DRL to various complex dynamic scheduling scenarios. Comprehensive experimental results demonstrate the superiority of our approach over state-of-the-art methods from both the perspective of scheduling solution quality and the adaptability of learned DRL models.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103664"},"PeriodicalIF":4.1,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1016/j.sysarc.2025.103666
Ewan Massey, Pierre Olivier
WebAssembly is a binary executable format designed as a compilation target enabling high-level language code to be run natively in web browsers, JavaScript runtimes, and standalone interpreters. Previous work has highlighted WebAssembly’s vulnerability to traditional memory exploits, such as stack smashing (stack-based buffer overflows), when compiled from memory-unsafe languages. Such vulnerabilities are used as a component in impactful end-to-end exploits, hence the design and implementation in WebAssembly of mitigations against memory exploits, such as stack canaries, is needed. We present WASP, an implementation of stack-based buffer overflow protection using stack canaries within Emscripten, the leading C and C++ to WebAssembly compiler. Further, we provide an extension to the standard stack smashing protection design, offering extra security against canary leak attacks by randomizing the canary on a per-function call basis. We verify WASP’s effectiveness against proof-of-concept exploits. Evaluation results show that the overheads brought by WASP on execution time, executable binary size, and compilation workflow are negligible to low in all platforms considered: the Chromium web browser, the Node.js JavaScript runtime, as well as the standalone WebAssembly runtimes Wasmer and WAVM.
{"title":"WASP: Stack protection for WebAssembly","authors":"Ewan Massey, Pierre Olivier","doi":"10.1016/j.sysarc.2025.103666","DOIUrl":"10.1016/j.sysarc.2025.103666","url":null,"abstract":"<div><div>WebAssembly is a binary executable format designed as a compilation target enabling high-level language code to be run natively in web browsers, JavaScript runtimes, and standalone interpreters. Previous work has highlighted WebAssembly’s vulnerability to traditional memory exploits, such as stack smashing (stack-based buffer overflows), when compiled from memory-unsafe languages. Such vulnerabilities are used as a component in impactful end-to-end exploits, hence the design and implementation in WebAssembly of mitigations against memory exploits, such as stack canaries, is needed. We present WASP, an implementation of stack-based buffer overflow protection using stack canaries within Emscripten, the leading C and C<span>++</span> to WebAssembly compiler. Further, we provide an extension to the standard stack smashing protection design, offering extra security against canary leak attacks by randomizing the canary on a per-function call basis. We verify WASP’s effectiveness against proof-of-concept exploits. Evaluation results show that the overheads brought by WASP on execution time, executable binary size, and compilation workflow are negligible to low in all platforms considered: the Chromium web browser, the Node.js JavaScript runtime, as well as the standalone WebAssembly runtimes Wasmer and WAVM.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103666"},"PeriodicalIF":4.1,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.sysarc.2025.103665
Mohsin Ali , Muneeb Ul Hassan , Pei-Wei Tsai , Jinjun Chen
As the adoption of electric vehicles (EVs) has skyrocketed in the past few decades, data-dependent services integrated into charging stations (CS) raise additional alarming concerns. Adversaries exploiting the privacy of individuals have been taken care of extensively by deploying techniques such as differential privacy (DP) and encryption-based approaches. However, these previous approaches worked effectively with sequential or single query, but were not useful for parallel queries. This paper proposed a novel and interactive approach termed CDP-INT, which aimed to tackle the multiple queries targeted at the same dataset, precluding exploitation of sensitive information of the user. This proposed mechanism is effectively tailored for EVs and CS in which the total privacy budget is distributed among a number of parallel queries. This research ensures the robust protection of privacy in response to multiple queries, maintaining the optimum trade-off between utility and privacy by implementing dynamic allocation of the in a concurrent model. Furthermore, the experimental evaluation section showcased the efficacy of CDP-INT in comparison to other approaches working on the sequential mechanism to tackle the queries. Thus, the experimental evaluation has also vouched that CDP-INT is a viable solution offering privacy to sensitive information in response to multiple queries.
{"title":"ECDPA: An enhanced concurrent differentially private algorithm in electric vehicles for parallel queries","authors":"Mohsin Ali , Muneeb Ul Hassan , Pei-Wei Tsai , Jinjun Chen","doi":"10.1016/j.sysarc.2025.103665","DOIUrl":"10.1016/j.sysarc.2025.103665","url":null,"abstract":"<div><div>As the adoption of electric vehicles (EVs) has skyrocketed in the past few decades, data-dependent services integrated into charging stations (CS) raise additional alarming concerns. Adversaries exploiting the privacy of individuals have been taken care of extensively by deploying techniques such as differential privacy (DP) and encryption-based approaches. However, these previous approaches worked effectively with sequential or single query, but were not useful for parallel queries. This paper proposed a novel and interactive approach termed <em>CDP-INT</em>, which aimed to tackle the multiple queries targeted at the same dataset, precluding exploitation of sensitive information of the user. This proposed mechanism is effectively tailored for EVs and CS in which the total privacy budget <span><math><mi>ϵ</mi></math></span> is distributed among a number of parallel queries. This research ensures the robust protection of privacy in response to multiple queries, maintaining the optimum trade-off between utility and privacy by implementing dynamic allocation of the <span><math><mi>ϵ</mi></math></span> in a concurrent model. Furthermore, the experimental evaluation section showcased the efficacy of CDP-INT in comparison to other approaches working on the sequential mechanism to tackle the queries. Thus, the experimental evaluation has also vouched that CDP-INT is a viable solution offering privacy to sensitive information in response to multiple queries.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103665"},"PeriodicalIF":4.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145760624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.sysarc.2025.103632
Yaniv Levi, Odem Harel, Adam Teman, Leonid Yavits
Associative processors (APs) are massively-parallel in-memory SIMD accelerators. While fairly well-known, APs have been revisited in recent years due to the proliferation of data-centric computing, and specifically, processing using memory. APs are based on Content Addressable Memory and utilize its unique ability to simultaneously search the entire memory content for a query pattern to implement massively parallel computations in memory. Several memory infrastructures have been considered for associative processing, including static CMOS, resistive, magnetoresistive, ferroelectric and even NAND flash memories. While all of these have certain merits (speed and low energy consumption for static CMOS, density for resistive and ferroelectric memories), they also face challenges (low density for static CMOS and magnetoresistive, limited write endurance and high write energy for resistive and ferroelectric memories), which limit the scalability and usefulness of APs. This work introduces GainP, an AP based on silicon-proven Gain Cell embedded DRAM (GCeDRAM). The latter combines relatively high density (compared to static CMOS memory) with low energy, high speed, practically unlimited endurance and low production costs (compared to emerging memory technologies). Using sparse-by-sparse matrix multiplication, we show that GainP outperforms high-performance CPU and GPU by and . We also show that GainP outperforms state-of-the-art processing-in-memory sparse matrix multiplication accelerators GAS, OuterSPACE and MatRaptor by , and , respectively, and provides average energy benefits of , and , respectively.
{"title":"GainP: A Gain Cell Embedded DRAM-based associative in-memory processor","authors":"Yaniv Levi, Odem Harel, Adam Teman, Leonid Yavits","doi":"10.1016/j.sysarc.2025.103632","DOIUrl":"10.1016/j.sysarc.2025.103632","url":null,"abstract":"<div><div>Associative processors (APs) are massively-parallel in-memory SIMD accelerators. While fairly well-known, APs have been revisited in recent years due to the proliferation of data-centric computing, and specifically, processing using memory. APs are based on Content Addressable Memory and utilize its unique ability to simultaneously search the entire memory content for a query pattern to implement massively parallel computations in memory. Several memory infrastructures have been considered for associative processing, including static CMOS, resistive, magnetoresistive, ferroelectric and even NAND flash memories. While all of these have certain merits (speed and low energy consumption for static CMOS, density for resistive and ferroelectric memories), they also face challenges (low density for static CMOS and magnetoresistive, limited write endurance and high write energy for resistive and ferroelectric memories), which limit the scalability and usefulness of APs. This work introduces GainP, an AP based on silicon-proven Gain Cell embedded DRAM (GCeDRAM). The latter combines relatively high density (compared to static CMOS memory) with low energy, high speed, practically unlimited endurance and low production costs (compared to emerging memory technologies). Using sparse-by-sparse matrix multiplication, we show that GainP outperforms high-performance CPU and GPU by <span><math><mrow><mn>825</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>41</mn><mo>×</mo></mrow></math></span>. We also show that GainP outperforms state-of-the-art processing-in-memory sparse matrix multiplication accelerators GAS, OuterSPACE and MatRaptor by <span><math><mrow><mn>128</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>125</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>16</mn><mo>×</mo></mrow></math></span>, respectively, and provides average energy benefits of <span><math><mrow><mn>96</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>95</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>15</mn><mo>×</mo></mrow></math></span>, respectively.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103632"},"PeriodicalIF":4.1,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1016/j.sysarc.2025.103650
Hyungjun Jang , Dongho Ha , Hyunwuk Lee , Won Woo Ro
The rapid evolution of Deep Neural Networks (DNNs) has driven significant advances in Domain-Specific Accelerators (DSAs). However, efficiently exploiting DSAs across diverse workloads remains challenging because complementary techniques—from sparsity-aware computation to system-level innovations such as multi-core architectures—have progressed independently. Our analysis reveals pronounced tile-level sparsity variations within the DNNs, which cause efficiency fluctuations on homogeneous accelerators built solely from dense or sparsity-oriented cores. To address this challenge, we present DeSpa, a novel heterogeneous multi-core accelerator architecture that integrates both dense and sparse cores to dynamically adapt to tile-level sparsity variations. DeSpa is paired with a heterogeneity-aware scheduler that employs a tile-stealing mechanism to maximize core utilization and minimize idle time. Compared to a homogeneous sparse multi-core baseline, DeSpa reduces energy consumption by 33% and improves energy-delay product (EDP) by 14%, albeit at the cost of a 35% latency increase. Relative to a homogeneous dense baseline, it reduces EDP by 44%, cuts energy consumption by 42%, and delivers a speed-up.
{"title":"DeSpa: Heterogeneous multi-core accelerators for energy-efficient dense and sparse computation at the tile level in Deep Neural Networks","authors":"Hyungjun Jang , Dongho Ha , Hyunwuk Lee , Won Woo Ro","doi":"10.1016/j.sysarc.2025.103650","DOIUrl":"10.1016/j.sysarc.2025.103650","url":null,"abstract":"<div><div>The rapid evolution of Deep Neural Networks (DNNs) has driven significant advances in Domain-Specific Accelerators (DSAs). However, efficiently exploiting DSAs across diverse workloads remains challenging because complementary techniques—from sparsity-aware computation to system-level innovations such as multi-core architectures—have progressed independently. Our analysis reveals pronounced tile-level sparsity variations within the DNNs, which cause efficiency fluctuations on homogeneous accelerators built solely from dense or sparsity-oriented cores. To address this challenge, we present DeSpa, a novel heterogeneous multi-core accelerator architecture that integrates both dense and sparse cores to dynamically adapt to tile-level sparsity variations. DeSpa is paired with a heterogeneity-aware scheduler that employs a tile-stealing mechanism to maximize core utilization and minimize idle time. Compared to a homogeneous sparse multi-core baseline, DeSpa reduces energy consumption by 33% and improves energy-delay product (EDP) by 14%, albeit at the cost of a 35% latency increase. Relative to a homogeneous dense baseline, it reduces EDP by 44%, cuts energy consumption by 42%, and delivers a <span><math><mrow><mn>1</mn><mo>.</mo><mn>34</mn><mo>×</mo></mrow></math></span> speed-up.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103650"},"PeriodicalIF":4.1,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145792174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1016/j.sysarc.2025.103663
Muhammad Nadeem Ali , Ihsan Ullah , Muhammad Imran , Muhammad Salah ud din , Byung-Seo Kim
Information-Centric Networking (ICN)-based edge computing has demonstrated remarkable potential in meeting ultra-low latency and reliable communication for offloading compute-intensive applications. Such applications are often composed of interdependent microservices that demand abundant communication and intensive computing resources. To avoid dependency conflict, these microservices are typically arranged in a predefined sequence prior to offloading; however, this introduces waiting time for each microservice in the sequence. This paper presents an ICN-edge computing-based testbed framework to demonstrate the practical applicability of a study named IFCNS, which proposes a unique solution to reduce the offloading time of dependent microservices compared to an existing scheme, named OTOOA. In the testbed, the IFCNS and OTOOA schemes are implemented on the Raspberry Pi devices, Named Data Network (NDN) codebase in a Python script. Furthermore, this paper outlined the comprehensive testbed development procedure, including hardware and software configuration. To evaluate the effectiveness of the IFCNS scheme, modifications are applied to the NDN naming, microservice tracking functions, and forwarding strategy. The experimental results corroborate the effectiveness of the IFCNS as compared to OTOOA, demonstrating superior performance in time consumption, average interest satisfaction delay, energy consumption, FIB table load, and average naming overhead.
{"title":"Dependency-aware microservices offloading in ICN-based edge computing testbed","authors":"Muhammad Nadeem Ali , Ihsan Ullah , Muhammad Imran , Muhammad Salah ud din , Byung-Seo Kim","doi":"10.1016/j.sysarc.2025.103663","DOIUrl":"10.1016/j.sysarc.2025.103663","url":null,"abstract":"<div><div>Information-Centric Networking (ICN)-based edge computing has demonstrated remarkable potential in meeting ultra-low latency and reliable communication for offloading compute-intensive applications. Such applications are often composed of interdependent microservices that demand abundant communication and intensive computing resources. To avoid dependency conflict, these microservices are typically arranged in a predefined sequence prior to offloading; however, this introduces waiting time for each microservice in the sequence. This paper presents an ICN-edge computing-based testbed framework to demonstrate the practical applicability of a study named IFCNS, which proposes a unique solution to reduce the offloading time of dependent microservices compared to an existing scheme, named OTOOA. In the testbed, the IFCNS and OTOOA schemes are implemented on the Raspberry Pi devices, Named Data Network (NDN) codebase in a Python script. Furthermore, this paper outlined the comprehensive testbed development procedure, including hardware and software configuration. To evaluate the effectiveness of the IFCNS scheme, modifications are applied to the NDN naming, microservice tracking functions, and forwarding strategy. The experimental results corroborate the effectiveness of the IFCNS as compared to OTOOA, demonstrating superior performance in time consumption, average interest satisfaction delay, energy consumption, FIB table load, and average naming overhead.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"171 ","pages":"Article 103663"},"PeriodicalIF":4.1,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}