首页 > 最新文献

Journal of Systems Architecture最新文献

英文 中文
A load-balanced acceleration method for small and irregular batch matrix multiplication on GPU
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-23 DOI: 10.1016/j.sysarc.2025.103341
Yu Zhang , Lu Lu , Zhanyu Yang , Zhihong Liang , Siliang Suo
As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many applications, such as high-performance computing, machine learning, etc. In practice, the performance of GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, a common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism and ignore the issues of low computational efficiency and hardware resource utilization caused by the difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multi-thread kernel design can address the workload unbalance between wavefronts in different workgroups, and the efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms previous methods.
{"title":"A load-balanced acceleration method for small and irregular batch matrix multiplication on GPU","authors":"Yu Zhang ,&nbsp;Lu Lu ,&nbsp;Zhanyu Yang ,&nbsp;Zhihong Liang ,&nbsp;Siliang Suo","doi":"10.1016/j.sysarc.2025.103341","DOIUrl":"10.1016/j.sysarc.2025.103341","url":null,"abstract":"<div><div>As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many applications, such as high-performance computing, machine learning, etc. In practice, the performance of GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, a common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism and ignore the issues of low computational efficiency and hardware resource utilization caused by the difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multi-thread kernel design can address the workload unbalance between wavefronts in different workgroups, and the efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms previous methods.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103341"},"PeriodicalIF":3.7,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143130298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A hash-based post-quantum ring signature scheme for the Internet of Vehicles
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-23 DOI: 10.1016/j.sysarc.2025.103345
Shuanggen Liu , Xiayi Zhou , Xu An Wang , Zixuan Yan , He Yan , Yurui Cao
With the rapid development of the Internet of Vehicles, securing data transmission has become crucial, especially given the threat posed by quantum computing to traditional digital signatures. This paper presents a hash-based post-quantum ring signature scheme built upon the XMSS hash-based signature framework, leveraging Merkle trees for efficient data organization and verification. In addition, the scheme is applied to the Internet of Vehicles, ensuring both anonymity and traceability while providing robust quantum-resistant security. Evaluation results indicate that, compared to other schemes, the proposed method achieves superior verification speed while ensuring data security and privacy.
{"title":"A hash-based post-quantum ring signature scheme for the Internet of Vehicles","authors":"Shuanggen Liu ,&nbsp;Xiayi Zhou ,&nbsp;Xu An Wang ,&nbsp;Zixuan Yan ,&nbsp;He Yan ,&nbsp;Yurui Cao","doi":"10.1016/j.sysarc.2025.103345","DOIUrl":"10.1016/j.sysarc.2025.103345","url":null,"abstract":"<div><div>With the rapid development of the Internet of Vehicles, securing data transmission has become crucial, especially given the threat posed by quantum computing to traditional digital signatures. This paper presents a hash-based post-quantum ring signature scheme built upon the XMSS hash-based signature framework, leveraging Merkle trees for efficient data organization and verification. In addition, the scheme is applied to the Internet of Vehicles, ensuring both anonymity and traceability while providing robust quantum-resistant security. Evaluation results indicate that, compared to other schemes, the proposed method achieves superior verification speed while ensuring data security and privacy.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103345"},"PeriodicalIF":3.7,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143130297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Component-based architectural regression test selection for modularized software systems
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-18 DOI: 10.1016/j.sysarc.2025.103343
Mohammed Al-Refai , Mahmoud M. Hammad
Regression testing is an essential part of software development, but it can be costly and require significant computational resources. Regression Test Selection (RTS) improves regression testing efficiency by only re-executing the tests that have been affected by code changes. Recently, dynamic and static RTS techniques for Java projects showed that selecting tests at a coarser granularity, class-level, is more effective than selecting tests at a finer granularity, method- or statement-level. However, prior techniques are mainly considering Java object-oriented projects but not modularized Java projects. Given the explicit support of architectural constructs introduced by the Java Platform Module System (JPMS) in the ninth edition of Java, these research efforts are not customized for component-based Java projects. To that end, we propose two static component-based RTS approaches called CORTS and its variant C2RTS tailored for component-based Java software systems. CORTS leverages the architectural information such as components and ports, specified in the module descriptor files, to construct module-level dependency graph and identify relevant tests. The variant, C2RTS, is a hybrid approach in which it integrates analysis at both the module and class levels, employing module descriptor files and compile-time information to construct the dependency graph and identify relevant tests.
We evaluated CORTS and C2RTS on 1200 revisions of 12 real-world open source software systems, and compared the results with those of class-level dynamic (Ekstazi) and static (STARTS) RTS approaches. The results showed that CORTS and C2RTS outperformed the static class-level RTS in terms of safety violation that measures to what extent an RTS technique misses test cases that should be selected. Using Ekstazi as the baseline, the average safety violation with respect to Ekstazi was 1.14% for CORTS, 2.21% for C2RTS, and 3.19% for STARTS. On the other hand, the results showed that CORTS and C2RTS selected more test cases than Ekstazi and STARTS. The average reduction in test suite size was 22.78% for CORTS and 43.47% for C2RTS comparing to the 68.48% for STARTS and 84.21% for Ekstazi. For all the studied subjects, CORTS and C2RTS reduced the size of the static dependency graphs compared to those generated by static class-level RTS, leading to faster graph construction and analysis for test case selection. Additionally, CORTS and C2RTS achieved reductions in overall end-to-end regression testing time compared to the retest-all strategy.
{"title":"Component-based architectural regression test selection for modularized software systems","authors":"Mohammed Al-Refai ,&nbsp;Mahmoud M. Hammad","doi":"10.1016/j.sysarc.2025.103343","DOIUrl":"10.1016/j.sysarc.2025.103343","url":null,"abstract":"<div><div>Regression testing is an essential part of software development, but it can be costly and require significant computational resources. Regression Test Selection (RTS) improves regression testing efficiency by only re-executing the tests that have been affected by code changes. Recently, dynamic and static RTS techniques for Java projects showed that selecting tests at a coarser granularity, class-level, is more effective than selecting tests at a finer granularity, method- or statement-level. However, prior techniques are mainly considering Java object-oriented projects but not modularized Java projects. Given the explicit support of architectural constructs introduced by the <em>Java Platform Module System (JPMS)</em> in the ninth edition of Java, these research efforts are not customized for component-based Java projects. To that end, we propose two static component-based RTS approaches called CORTS and its variant C2RTS tailored for component-based Java software systems. CORTS leverages the architectural information such as components and ports, specified in the module descriptor files, to construct module-level dependency graph and identify relevant tests. The variant, C2RTS, is a hybrid approach in which it integrates analysis at both the module and class levels, employing module descriptor files and compile-time information to construct the dependency graph and identify relevant tests.</div><div>We evaluated CORTS and C2RTS on 1200 revisions of 12 real-world open source software systems, and compared the results with those of class-level dynamic (Ekstazi) and static (STARTS) RTS approaches. The results showed that CORTS and C2RTS outperformed the static class-level RTS in terms of safety violation that measures to what extent an RTS technique misses test cases that should be selected. Using Ekstazi as the baseline, the average safety violation with respect to Ekstazi was 1.14% for CORTS, 2.21% for C2RTS, and 3.19% for STARTS. On the other hand, the results showed that CORTS and C2RTS selected more test cases than Ekstazi and STARTS. The average reduction in test suite size was 22.78% for CORTS and 43.47% for C2RTS comparing to the 68.48% for STARTS and 84.21% for Ekstazi. For all the studied subjects, CORTS and C2RTS reduced the size of the static dependency graphs compared to those generated by static class-level RTS, leading to faster graph construction and analysis for test case selection. Additionally, CORTS and C2RTS achieved reductions in overall end-to-end regression testing time compared to the retest-all strategy.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103343"},"PeriodicalIF":3.7,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143130300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An efficient string solver for string constraints with regex-counting and string-length
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-17 DOI: 10.1016/j.sysarc.2025.103340
Denghang Hu, Zhilin Wu
Regular expressions (regex for short) and string-length function are widely used in string-manipulating programs. Counting is a frequently used feature in regexes that counts the number of matchings of sub-patterns. The state-of-the-art string solvers are incapable of solving string constraints with regex-counting and string-length efficiently, especially when the counting and length bounds are large. In this work, we propose an automata-theoretic approach for solving such class of string constraints. The main idea is to symbolically model the counting operators by registers in automata instead of unfolding them explicitly, thus alleviating the state explosion problem. Moreover, the string-length function is modeled by a register as well. As a result, the satisfiability of string constraints with regex-counting and string-length is reduced to the satisfiability of linear integer arithmetic, which the off-the-shelf SMT solvers can then solve. To improve the performance further, we also propose various optimization techniques. We implement the algorithms and validate our approach on 49,843 benchmark instances. The experimental results show that our approach can solve more instances than the state-of-the-art solvers, at a comparable or faster speed, especially when the counting and length bounds are large or when the counting operators are nested with some other counting operators or complement operators.
{"title":"An efficient string solver for string constraints with regex-counting and string-length","authors":"Denghang Hu,&nbsp;Zhilin Wu","doi":"10.1016/j.sysarc.2025.103340","DOIUrl":"10.1016/j.sysarc.2025.103340","url":null,"abstract":"<div><div>Regular expressions (regex for short) and string-length function are widely used in string-manipulating programs. Counting is a frequently used feature in regexes that counts the number of matchings of sub-patterns. The state-of-the-art string solvers are incapable of solving string constraints with regex-counting and string-length efficiently, especially when the counting and length bounds are large. In this work, we propose an automata-theoretic approach for solving such class of string constraints. The main idea is to symbolically model the counting operators by registers in automata instead of unfolding them explicitly, thus alleviating the state explosion problem. Moreover, the string-length function is modeled by a register as well. As a result, the satisfiability of string constraints with regex-counting and string-length is reduced to the satisfiability of linear integer arithmetic, which the off-the-shelf SMT solvers can then solve. To improve the performance further, we also propose various optimization techniques. We implement the algorithms and validate our approach on 49,843 benchmark instances. The experimental results show that our approach can solve more instances than the state-of-the-art solvers, at a comparable or faster speed, especially when the counting and length bounds are large or when the counting operators are nested with some other counting operators or complement operators.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103340"},"PeriodicalIF":3.7,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143237932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LE-GEMM: A lightweight emulation-based GEMM with precision refinement on GPU
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-17 DOI: 10.1016/j.sysarc.2025.103336
Yu Zhang , Lu Lu , Zhanyu Yang , Zhihong Liang , Siliang Suo
Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in various scientific computing scenarios. These units support tensor-level computation with different precisions on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication (GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline implementation to accelerate the computation process without compromising the accuracy requirements. First, we propose a lightweight emulation algorithm that includes a precision transformation process and GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that the proposed method outperforms the previous approaches in terms of computational accuracy and speed.
{"title":"LE-GEMM: A lightweight emulation-based GEMM with precision refinement on GPU","authors":"Yu Zhang ,&nbsp;Lu Lu ,&nbsp;Zhanyu Yang ,&nbsp;Zhihong Liang ,&nbsp;Siliang Suo","doi":"10.1016/j.sysarc.2025.103336","DOIUrl":"10.1016/j.sysarc.2025.103336","url":null,"abstract":"<div><div>Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in various scientific computing scenarios. These units support tensor-level computation with different precisions on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication (GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline implementation to accelerate the computation process without compromising the accuracy requirements. First, we propose a lightweight emulation algorithm that includes a precision transformation process and GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that the proposed method outperforms the previous approaches in terms of computational accuracy and speed.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103336"},"PeriodicalIF":3.7,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143130296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A CP-ABE-based access control scheme with cryptographic reverse firewall for IoV
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-17 DOI: 10.1016/j.sysarc.2025.103331
Xiaodong Yang , Xilai Luo , Zefan Liao , Wenjia Wang , Xiaoni Du , Shudong Li
The convergence of AI and internet technologies has sparked significant interest in the Internet of Vehicles (IoV) and intelligent transportation systems (ITS). However, the vast data generated within these systems poses challenges for onboard terminals and secure data sharing. To address these issues, we propose a novel solution combining ciphertext policy attribute-based encryption (CP-ABE) and a cryptographic reverse firewall (CRF) mechanism for IoV. This approach offers several advantages, including offline encryption and outsourced decryption to improve efficiency. The CRF mechanism adds an extra layer of security by re-randomizing vehicle data, protecting sensitive information. While single-attribute authority schemes simplify access control, they are not ideal for IoV environments. Therefore, we introduce a multi-authority scheme to enhance security. Performance analysis demonstrates our scheme’s ability to optimize encryption and decryption while safeguarding vehicle data confidentiality. In summary, our solution improves data management, access control, and security in the IoV, contributing to its safe and efficient development.
{"title":"A CP-ABE-based access control scheme with cryptographic reverse firewall for IoV","authors":"Xiaodong Yang ,&nbsp;Xilai Luo ,&nbsp;Zefan Liao ,&nbsp;Wenjia Wang ,&nbsp;Xiaoni Du ,&nbsp;Shudong Li","doi":"10.1016/j.sysarc.2025.103331","DOIUrl":"10.1016/j.sysarc.2025.103331","url":null,"abstract":"<div><div>The convergence of AI and internet technologies has sparked significant interest in the Internet of Vehicles (IoV) and intelligent transportation systems (ITS). However, the vast data generated within these systems poses challenges for onboard terminals and secure data sharing. To address these issues, we propose a novel solution combining ciphertext policy attribute-based encryption (CP-ABE) and a cryptographic reverse firewall (CRF) mechanism for IoV. This approach offers several advantages, including offline encryption and outsourced decryption to improve efficiency. The CRF mechanism adds an extra layer of security by re-randomizing vehicle data, protecting sensitive information. While single-attribute authority schemes simplify access control, they are not ideal for IoV environments. Therefore, we introduce a multi-authority scheme to enhance security. Performance analysis demonstrates our scheme’s ability to optimize encryption and decryption while safeguarding vehicle data confidentiality. In summary, our solution improves data management, access control, and security in the IoV, contributing to its safe and efficient development.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103331"},"PeriodicalIF":3.7,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143130295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-09 DOI: 10.1016/j.sysarc.2025.103339
Gun Ko, Jiwon Lee, Hongju Kal, Hyunwuk Lee, Won Woo Ro
With the increasing demands of modern workloads, multi-GPU systems have emerged as a scalable solution, extending performance beyond the capabilities of single GPUs. However, these systems face significant challenges in managing memory across multiple GPUs, particularly due to the Non-Uniform Memory Access (NUMA) effect, which introduces latency penalties when accessing remote memory. To mitigate NUMA overheads, GPUs typically cache remote memory accesses across multiple levels of the cache hierarchy, which are kept coherent using cache coherence protocols. The traditional GPU bulk-synchronous programming (BSP) model relies on coarse-grained invalidations and cache flushes at kernel boundaries, which are insufficient for the fine-grained communication patterns required by emerging applications. In multi-GPU systems, where NUMA is a major bottleneck, substantial data movement resulting from the bulk cache invalidations exacerbates performance overheads. Recent cache coherence protocol for multi-GPUs enables flexible data sharing through coherence directories that track shared data at a fine-grained level across GPUs. However, these directories limited in capacity, leading to frequent evictions and unnecessary invalidations, which increase cache misses and degrade performance. To address these challenges, we propose REC, a low-cost architectural solution that enhances the effective tracking capacity of coherence directories by leveraging memory access locality. REC coalesces multiple tag addresses from remote read requests within common address ranges, reducing directory storage overhead while maintaining fine-grained coherence for writes. Our evaluation on a 4-GPU system shows that REC reduces L2 cache misses by 53.5% and improves overall system performance by 32.7% across a variety of GPU workloads.
{"title":"REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems","authors":"Gun Ko,&nbsp;Jiwon Lee,&nbsp;Hongju Kal,&nbsp;Hyunwuk Lee,&nbsp;Won Woo Ro","doi":"10.1016/j.sysarc.2025.103339","DOIUrl":"10.1016/j.sysarc.2025.103339","url":null,"abstract":"<div><div>With the increasing demands of modern workloads, multi-GPU systems have emerged as a scalable solution, extending performance beyond the capabilities of single GPUs. However, these systems face significant challenges in managing memory across multiple GPUs, particularly due to the Non-Uniform Memory Access (NUMA) effect, which introduces latency penalties when accessing remote memory. To mitigate NUMA overheads, GPUs typically cache remote memory accesses across multiple levels of the cache hierarchy, which are kept coherent using cache coherence protocols. The traditional GPU bulk-synchronous programming (BSP) model relies on coarse-grained invalidations and cache flushes at kernel boundaries, which are insufficient for the fine-grained communication patterns required by emerging applications. In multi-GPU systems, where NUMA is a major bottleneck, substantial data movement resulting from the bulk cache invalidations exacerbates performance overheads. Recent cache coherence protocol for multi-GPUs enables flexible data sharing through coherence directories that track shared data at a fine-grained level across GPUs. However, these directories limited in capacity, leading to frequent evictions and unnecessary invalidations, which increase cache misses and degrade performance. To address these challenges, we propose REC, a low-cost architectural solution that enhances the effective tracking capacity of coherence directories by leveraging memory access locality. REC coalesces multiple tag addresses from remote read requests within common address ranges, reducing directory storage overhead while maintaining fine-grained coherence for writes. Our evaluation on a 4-GPU system shows that REC reduces L2 cache misses by 53.5% and improves overall system performance by 32.7% across a variety of GPU workloads.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103339"},"PeriodicalIF":3.7,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143130301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics ChipAI:一种可扩展的基于芯片的加速器,用于利用硅光子学进行有效的深度神经网络推断
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-26 DOI: 10.1016/j.sysarc.2024.103308
Hao Zhang , Haibo Zhang , Zhiyi Huang , Yawen Chen
To enhance the precision of inference, deep neural network (DNN) models have been progressively growing in scale and complexity, leading to increased latency and computational resource demands. This growth necessitates scalable architectures, such as chiplet-based accelerators, to accommodate the substantial volume of deep learning inference tasks. However, the efficiency, energy consumption, and scalability of existing accelerators are severely constrained by metallic interconnects. Photonic interconnects, on the contrary, offer a promising alternative, with their advantages of low latency, high bandwidth, high energy efficiency, and simplified communication processes. In this paper, we propose ChipAI, an accelerator designed based on photonic interconnects for accelerating DNN inference tasks. ChipAI implements an efficient hybrid optical network that supports effective inter-chiplet and intra-chiplet data sharing, thereby enhancing parallel processing capabilities. Additionally, we propose a flexible dataflow leveraging the ChipAI architecture and the characteristics of DNN models, facilitating efficient architectural mapping of DNN layers. Simulation on various DNN models demonstrates that, compared to the state-of-the-art chiplet-based DNN accelerator with photonic interconnects, ChipAI can reduce the DNN inference time and energy consumption by up to 82% and 79%, respectively.
为了提高推理的精度,深度神经网络(DNN)模型的规模和复杂性逐渐增加,导致延迟和计算资源需求增加。这种增长需要可扩展的架构,例如基于芯片的加速器,以适应大量的深度学习推理任务。然而,现有加速器的效率、能耗和可扩展性受到金属互连的严重限制。相反,光子互连具有低延迟、高带宽、高能效和简化通信过程的优点,提供了一个有前途的替代方案。在本文中,我们提出了一种基于光子互连设计的加速器ChipAI,用于加速DNN推理任务。ChipAI实现了高效的混合光网络,支持有效的片间和片内数据共享,从而增强了并行处理能力。此外,我们提出了一种灵活的数据流,利用ChipAI架构和深度神经网络模型的特点,促进深度神经网络层的有效架构映射。对各种深度神经网络模型的仿真表明,与最先进的基于光子互连的芯片的深度神经网络加速器相比,ChipAI可以将深度神经网络推理时间和能量消耗分别减少82%和79%。
{"title":"ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics","authors":"Hao Zhang ,&nbsp;Haibo Zhang ,&nbsp;Zhiyi Huang ,&nbsp;Yawen Chen","doi":"10.1016/j.sysarc.2024.103308","DOIUrl":"10.1016/j.sysarc.2024.103308","url":null,"abstract":"<div><div>To enhance the precision of inference, deep neural network (DNN) models have been progressively growing in scale and complexity, leading to increased latency and computational resource demands. This growth necessitates scalable architectures, such as chiplet-based accelerators, to accommodate the substantial volume of deep learning inference tasks. However, the efficiency, energy consumption, and scalability of existing accelerators are severely constrained by metallic interconnects. Photonic interconnects, on the contrary, offer a promising alternative, with their advantages of low latency, high bandwidth, high energy efficiency, and simplified communication processes. In this paper, we propose ChipAI, an accelerator designed based on photonic interconnects for accelerating DNN inference tasks. ChipAI implements an efficient hybrid optical network that supports effective inter-chiplet and intra-chiplet data sharing, thereby enhancing parallel processing capabilities. Additionally, we propose a flexible dataflow leveraging the ChipAI architecture and the characteristics of DNN models, facilitating efficient architectural mapping of DNN layers. Simulation on various DNN models demonstrates that, compared to the state-of-the-art chiplet-based DNN accelerator with photonic interconnects, ChipAI can reduce the DNN inference time and energy consumption by up to 82% and 79%, respectively.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"158 ","pages":"Article 103308"},"PeriodicalIF":3.7,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-interactive set intersection for privacy-preserving contact tracing 用于保护隐私的联系人追踪的非交互式集合交集
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-24 DOI: 10.1016/j.sysarc.2024.103307
Axin Wu , Yuer Yang , Jinghang Wen , Yu Zhang , Qiuxia Zhao
Contact tracing (CT) is an effective method to combat the spread of infectious diseases like COVID-19, by notifying and alerting individuals who have been in contact with infected patients. One simple yet practical approach for implementing CT functionality is to directly publish the travel history and locations visited by infected users. However, this approach compromises the location privacy and makes infected individuals reluctant to participate in such systems. Private set intersection (PSI) is a promising candidate to protect the privacy of participants. But, interactive PSI protocols may not be friendly for querists with limited resources due to high local computation costs and communication bandwidth requirements. Additionally, concerns about identity leakage may result in infected users missing or providing erroneous information about their visited locations. To address the above issues, we propose a cloud-assisted non-interactive framework for privacy-preserving CT, which enables querists to obtain query results without multi-round interaction and addresses concerns regarding location and identity information leakage. Its core building block is a cloud-assisted non-interactive set intersection protocol, skillfully transformed from anonymous broadcast encryption (AnoBE). To our knowledge, this is the first derivation from AnoBE. We also instantiate the proposed framework and thoroughly evaluate its performance, demonstrating its efficiency.
接触追踪(CT)是防止 COVID-19 等传染病传播的一种有效方法,它可以通知和提醒曾接触过受感染病人的个人。实现 CT 功能的一种简单而实用的方法是直接公布受感染用户的旅行记录和访问地点。然而,这种方法会损害位置隐私,使受感染者不愿参与此类系统。私人集合交集(PSI)是保护参与者隐私的一种可行方法。但是,由于本地计算成本和通信带宽要求较高,交互式 PSI 协议对于资源有限的查询者来说可能并不友好。此外,由于担心身份泄露,受感染的用户可能会遗漏或提供错误的访问地点信息。为解决上述问题,我们提出了一种用于隐私保护 CT 的云辅助非交互式框架,使查询者无需多轮交互即可获得查询结果,并解决了位置和身份信息泄漏的问题。它的核心构件是一个云辅助非交互式集合交集协议,由匿名广播加密(AnoBE)巧妙地转化而来。据我们所知,这是从匿名广播加密(AnoBE)衍生出来的第一个协议。我们还对所提出的框架进行了实例化,并对其性能进行了全面评估,从而证明了它的高效性。
{"title":"Non-interactive set intersection for privacy-preserving contact tracing","authors":"Axin Wu ,&nbsp;Yuer Yang ,&nbsp;Jinghang Wen ,&nbsp;Yu Zhang ,&nbsp;Qiuxia Zhao","doi":"10.1016/j.sysarc.2024.103307","DOIUrl":"10.1016/j.sysarc.2024.103307","url":null,"abstract":"<div><div>Contact tracing (CT) is an effective method to combat the spread of infectious diseases like COVID-19, by notifying and alerting individuals who have been in contact with infected patients. One simple yet practical approach for implementing CT functionality is to directly publish the travel history and locations visited by infected users. However, this approach compromises the location privacy and makes infected individuals reluctant to participate in such systems. Private set intersection (PSI) is a promising candidate to protect the privacy of participants. But, interactive PSI protocols may not be friendly for querists with limited resources due to high local computation costs and communication bandwidth requirements. Additionally, concerns about identity leakage may result in infected users missing or providing erroneous information about their visited locations. To address the above issues, we propose a cloud-assisted non-interactive framework for privacy-preserving CT, which enables querists to obtain query results without multi-round interaction and addresses concerns regarding location and identity information leakage. Its core building block is a cloud-assisted non-interactive set intersection protocol, skillfully transformed from anonymous broadcast encryption (AnoBE). To our knowledge, this is the first derivation from AnoBE. We also instantiate the proposed framework and thoroughly evaluate its performance, demonstrating its efficiency.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"158 ","pages":"Article 103307"},"PeriodicalIF":3.7,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Survey of Edge Caching Security: Framework, Methods, and Challenges 边缘缓存安全调查:框架、方法和挑战
IF 3.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-24 DOI: 10.1016/j.sysarc.2024.103306
Hang Zhang, Jinsong Wang, Zening Zhao, Zhao Zhao
Edge caching reduces the frequent communication between users and remote cloud servers by caching popular content at the network edge, which can decrease response latency and improve the user service experience. However, the openness and vulnerability of edge caching introduce several security risks. The existing research work on edge caching security only focuses on certain specific aspects and does not consider edge caching security from a global perspective. Therefore, the paper provides a comprehensive review of edge caching security in order to accelerate the development of related research areas. Specifically, this paper first introduces the traditional and extended models of edge caching, the threats to edge caching, and the key metrics for implementing edge caching security. Then, we propose a comprehensive security framework of edge caching that considers content request security, content transmission security, content caching security, and multi-party trusted collaboration. Moreover, the four aspects of security framework are respectively discussed in detail, aiming to achieve security protection for edge caching. Finally, a discussion is provided on the shortcomings of current edge caching security and potential future directions.
边缘缓存通过在网络边缘缓存流行内容来减少用户和远程云服务器之间的频繁通信,从而减少响应延迟并改善用户服务体验。然而,边缘缓存的开放性和脆弱性带来了一些安全风险。现有的边缘缓存安全研究工作只关注某些特定的方面,没有从全局角度考虑边缘缓存的安全性。因此,本文对边缘缓存安全性进行了全面的综述,以期促进相关研究领域的发展。具体来说,本文首先介绍了边缘缓存的传统和扩展模型、边缘缓存的威胁以及实现边缘缓存安全的关键指标。然后,我们提出了一个综合考虑内容请求安全性、内容传输安全性、内容缓存安全性和多方可信协作的边缘缓存安全框架。在此基础上,对安全框架的四个方面分别进行了详细讨论,旨在实现对边缘缓存的安全保护。最后,讨论了当前边缘缓存安全性的不足和潜在的未来发展方向。
{"title":"A Survey of Edge Caching Security: Framework, Methods, and Challenges","authors":"Hang Zhang,&nbsp;Jinsong Wang,&nbsp;Zening Zhao,&nbsp;Zhao Zhao","doi":"10.1016/j.sysarc.2024.103306","DOIUrl":"10.1016/j.sysarc.2024.103306","url":null,"abstract":"<div><div>Edge caching reduces the frequent communication between users and remote cloud servers by caching popular content at the network edge, which can decrease response latency and improve the user service experience. However, the openness and vulnerability of edge caching introduce several security risks. The existing research work on edge caching security only focuses on certain specific aspects and does not consider edge caching security from a global perspective. Therefore, the paper provides a comprehensive review of edge caching security in order to accelerate the development of related research areas. Specifically, this paper first introduces the traditional and extended models of edge caching, the threats to edge caching, and the key metrics for implementing edge caching security. Then, we propose a comprehensive security framework of edge caching that considers content request security, content transmission security, content caching security, and multi-party trusted collaboration. Moreover, the four aspects of security framework are respectively discussed in detail, aiming to achieve security protection for edge caching. Finally, a discussion is provided on the shortcomings of current edge caching security and potential future directions.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"158 ","pages":"Article 103306"},"PeriodicalIF":3.7,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Systems Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1