首页 > 最新文献

ACM Transactions on Embedded Computing Systems最新文献

英文 中文
Transient Fault Detection in Tensor Cores for Modern GPUs 现代 GPU 张量核中的瞬态故障检测
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-10 DOI: 10.1145/3687483
M. Hafezan, E. Atoofian
Deep Neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance. In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. High level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.
深度神经网络(DNN)已成为许多机器学习应用的有效解决方案。然而,巨大的成功伴随着过高的计算成本。英伟达™(NVIDIA®)公司的 Volta 图形处理器(GPU)引入了一种名为张量核(TC)的专用硬件单元,旨在满足 DNN 不断增长的计算需求。以往关于张量核的研究大多侧重于通过利用张量核的高度并行性来提高性能。然而,随着 DNN 被部署到自动驾驶等对安全敏感的应用中,TC 的可靠性与性能同样重要。在这项工作中,我们利用张量核心的独特架构特性,提出了一种名为 "张量核心故障检测(FDTC)"的简单而高效的硬件技术,用于检测张量核心中的瞬态故障。特别是,FDTC 利用网络剪枝产生的零值权重以及普通 ReLU 算子产生的稀疏激活来验证张量运算。张量的高稀疏性允许 FDTC 同时运行原始和验证产品,从而实现零性能损失。对于稀疏率较低的应用,FDTC 依靠时间冗余来重新执行有效乘积。FDTC 仅在乘法器空闲时才安排执行验证乘积。我们的实验结果表明,FDTC 在 TC 中提供了 100% 的故障覆盖率,而且没有性能损失,能量开销也很小。
{"title":"Transient Fault Detection in Tensor Cores for Modern GPUs","authors":"M. Hafezan, E. Atoofian","doi":"10.1145/3687483","DOIUrl":"https://doi.org/10.1145/3687483","url":null,"abstract":"Deep Neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance.\u0000 In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. High level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141920893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Dilithium Implementation with AVX2/-512 利用 AVX2/-512 优化 Dilithium 实施
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-10 DOI: 10.1145/3687309
Runqing Xu, Debiao He, Min Luo, Cong Peng, Xiangyong Zeng
Dilithium is a signature scheme that is currently being standardized to the Module-Lattice-Based Digital Signature Standard by NIST. It is believed to be secure even against attacks from large-scale quantum computers based on lattice problems. The implementation efficiency is important for promoting the migration of current cryptography algorithms to post-quantum cryptography algorithms. In this paper, we optimize the implementation of Dilithium with several new approaches proposed. Firstly, we improve the efficiency of parallel NTT implementations. The overhead of shuffling operations is reduced in our implementations, and fewer loading instructions are invoked for the precomputations. Then, we optimize the sampling and bit-packing of polynomial coefficients in Dilithium. We can handle double the number of coefficients within one register using a new approach for the sampling of secret key polynomials. The approaches proposed in this paper are applicable to implementations under AVX2 and AVX-512 instruction sets. Take Dilithium2 as an illustration, our AVX2 implementation demonstrates improvements of 22.7%, 16.9%, and 13.5% for KeyGen, Sign, and Verify compared to the previous implementation.
Dilithium 是一种签名方案,目前正被美国国家标准与技术研究院(NIST)标准化为基于模块晶格的数字签名标准。据信,即使面对来自基于晶格问题的大规模量子计算机的攻击,它也是安全的。实现效率对于促进当前加密算法向后量子加密算法的迁移非常重要。本文提出了几种新方法来优化 Dilithium 的实现。首先,我们提高了并行 NTT 实现的效率。在我们的实现中,洗牌操作的开销减少了,预计算所调用的加载指令也减少了。然后,我们优化了 Dilithium 中多项式系数的采样和位打包。我们使用一种新的密钥多项式采样方法,可以在一个寄存器内处理双倍数量的系数。本文提出的方法适用于 AVX2 和 AVX-512 指令集下的实现。以 Dilithium2 为例,与之前的实现相比,我们的 AVX2 实现在密钥生成、签名和验证方面分别提高了 22.7%、16.9% 和 13.5%。
{"title":"Optimizing Dilithium Implementation with AVX2/-512","authors":"Runqing Xu, Debiao He, Min Luo, Cong Peng, Xiangyong Zeng","doi":"10.1145/3687309","DOIUrl":"https://doi.org/10.1145/3687309","url":null,"abstract":"Dilithium is a signature scheme that is currently being standardized to the Module-Lattice-Based Digital Signature Standard by NIST. It is believed to be secure even against attacks from large-scale quantum computers based on lattice problems. The implementation efficiency is important for promoting the migration of current cryptography algorithms to post-quantum cryptography algorithms. In this paper, we optimize the implementation of Dilithium with several new approaches proposed. Firstly, we improve the efficiency of parallel NTT implementations. The overhead of shuffling operations is reduced in our implementations, and fewer loading instructions are invoked for the precomputations. Then, we optimize the sampling and bit-packing of polynomial coefficients in Dilithium. We can handle double the number of coefficients within one register using a new approach for the sampling of secret key polynomials. The approaches proposed in this paper are applicable to implementations under AVX2 and AVX-512 instruction sets. Take Dilithium2 as an illustration, our AVX2 implementation demonstrates improvements of 22.7%, 16.9%, and 13.5% for KeyGen, Sign, and Verify compared to the previous implementation.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141919332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Dilithium Implementation with AVX2/-512 利用 AVX2/-512 优化 Dilithium 实施
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-10 DOI: 10.1145/3687309
Runqing Xu, Debiao He, Min Luo, Cong Peng, Xiangyong Zeng
Dilithium is a signature scheme that is currently being standardized to the Module-Lattice-Based Digital Signature Standard by NIST. It is believed to be secure even against attacks from large-scale quantum computers based on lattice problems. The implementation efficiency is important for promoting the migration of current cryptography algorithms to post-quantum cryptography algorithms. In this paper, we optimize the implementation of Dilithium with several new approaches proposed. Firstly, we improve the efficiency of parallel NTT implementations. The overhead of shuffling operations is reduced in our implementations, and fewer loading instructions are invoked for the precomputations. Then, we optimize the sampling and bit-packing of polynomial coefficients in Dilithium. We can handle double the number of coefficients within one register using a new approach for the sampling of secret key polynomials. The approaches proposed in this paper are applicable to implementations under AVX2 and AVX-512 instruction sets. Take Dilithium2 as an illustration, our AVX2 implementation demonstrates improvements of 22.7%, 16.9%, and 13.5% for KeyGen, Sign, and Verify compared to the previous implementation.
Dilithium 是一种签名方案,目前正被美国国家标准与技术研究院(NIST)标准化为基于模块晶格的数字签名标准。据信,即使面对来自基于晶格问题的大规模量子计算机的攻击,它也是安全的。实现效率对于促进当前加密算法向后量子加密算法的迁移非常重要。本文提出了几种新方法来优化 Dilithium 的实现。首先,我们提高了并行 NTT 实现的效率。在我们的实现中,洗牌操作的开销减少了,预计算所调用的加载指令也减少了。然后,我们优化了 Dilithium 中多项式系数的采样和位打包。我们使用一种新的密钥多项式采样方法,可以在一个寄存器内处理双倍数量的系数。本文提出的方法适用于 AVX2 和 AVX-512 指令集下的实现。以 Dilithium2 为例,与之前的实现相比,我们的 AVX2 实现在密钥生成、签名和验证方面分别提高了 22.7%、16.9% 和 13.5%。
{"title":"Optimizing Dilithium Implementation with AVX2/-512","authors":"Runqing Xu, Debiao He, Min Luo, Cong Peng, Xiangyong Zeng","doi":"10.1145/3687309","DOIUrl":"https://doi.org/10.1145/3687309","url":null,"abstract":"Dilithium is a signature scheme that is currently being standardized to the Module-Lattice-Based Digital Signature Standard by NIST. It is believed to be secure even against attacks from large-scale quantum computers based on lattice problems. The implementation efficiency is important for promoting the migration of current cryptography algorithms to post-quantum cryptography algorithms. In this paper, we optimize the implementation of Dilithium with several new approaches proposed. Firstly, we improve the efficiency of parallel NTT implementations. The overhead of shuffling operations is reduced in our implementations, and fewer loading instructions are invoked for the precomputations. Then, we optimize the sampling and bit-packing of polynomial coefficients in Dilithium. We can handle double the number of coefficients within one register using a new approach for the sampling of secret key polynomials. The approaches proposed in this paper are applicable to implementations under AVX2 and AVX-512 instruction sets. Take Dilithium2 as an illustration, our AVX2 implementation demonstrates improvements of 22.7%, 16.9%, and 13.5% for KeyGen, Sign, and Verify compared to the previous implementation.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141919737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High Performance and Predictable Shared Last-level Cache for Safety-Critical Systems 用于安全关键型系统的高性能和可预测共享末级高速缓存
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-08 DOI: 10.1145/3687308
Zhuanhao Wu, A. Kaushik, Hiren D. Patel
We propose ZeroCost-LLC (ZCLLC), a novel shared inclusive last-level cache (LLC) design for timing predictable multi-core platforms that offers lower worst-case latency (WCL) when compared to a traditional shared inclusive LLC design. ZCLLC achieves low WCL by eliminating certain memory operations in the form of cache line invalidations across the cache hierarchy that are a consequence of a core’s memory request that misses in the cache hierarchy and when there is no vacant entry in the LLC to accommodate the fetched data for this request. In addition to low WCL, ZCLLC offers performance benefits in the form of additional caching capacity and unlike state-of-the-art approaches, ZCLLC does not impose any constraints on its usage across multiple cores. In this work, we describe the impact of LLC cache line invalidations on the WCL and systematically build solutions to eliminate these invalidations resulting in ZCLLC. We also present ZCLLC, an optimized variant of ZCLLC that offers lower WCL and improved average-case performance over ZCLLC. We apply optimizations to the shared bus arbitration mechanism and extend the micro-architecture of ZCLLC to allow for overlapping memory requests to the main memory. Our analysis reveals that the analytical WCL of a memory request under ZCLLC is 87.0%, 93.8%, and 97.1% lower than that under state-of-the-art LLC partition sharing techniques for 2, 4, and 8 cores, respectively. ZCLLC shows average-case performance speedups of 1.89 ×, 3.36 ×, and 6.24 × compared to the state-of-the-art LLC partition sharing techniques for 2, 4, and 8 cores, respectively. When compared to the original ZCLLC that does not have any optimizations, ZCLLC shows lower analytical WCLs that are 76.5%, 82.6%, and 86.2% lower compared to ZCLLC-NORMAL for 2, 4, and 8 cores, respectively.
我们提出了零成本末级高速缓存(ZeroCost-LLC,ZCLLC),这是一种新型共享包容性末级高速缓存(LLC)设计,适用于时序可预测的多核平台,与传统的共享包容性LLC设计相比,它能提供更低的最坏情况延迟(WCL)。ZCLLC 通过消除高速缓存层次结构中高速缓存行失效形式的某些内存操作来实现低 WCL,这种失效是内核内存请求在高速缓存层次结构中未命中以及 LLC 中没有空闲条目来容纳该请求所获取数据的结果。除了低 WCL 外,ZCLLC 还能以额外缓存容量的形式提供性能优势,而且与最先进的方法不同,ZCLLC 不会对其在多个内核中的使用施加任何限制。在这项工作中,我们描述了 LLC 缓存行失效对 WCL 的影响,并系统地构建了消除这些失效的解决方案,从而实现了 ZCLLC。我们还介绍了 ZCLLC,它是 ZCLLC 的优化变体,与 ZCLLC 相比,WCL 更低,平均性能更高。我们对共享总线仲裁机制进行了优化,并扩展了 ZCLLC 的微体系结构,允许向主存储器发出重叠内存请求。我们的分析表明,对于 2 核、4 核和 8 核,ZCLLC 下内存请求的分析 WCL 分别比最先进的 LLC 分区共享技术低 87.0%、93.8% 和 97.1%。对于 2 核、4 核和 8 核,ZCLLC 与最先进的 LLC 分区共享技术相比,平均性能分别提高了 1.89 倍、3.36 倍和 6.24 倍。与未做任何优化的原始 ZCLLC 相比,ZCLLC 的分析 WCL 更低,2、4 和 8 核的分析 WCL 分别比 ZCLLC-NORMAL 低 76.5%、82.6% 和 86.2%。
{"title":"High Performance and Predictable Shared Last-level Cache for Safety-Critical Systems","authors":"Zhuanhao Wu, A. Kaushik, Hiren D. Patel","doi":"10.1145/3687308","DOIUrl":"https://doi.org/10.1145/3687308","url":null,"abstract":"We propose ZeroCost-LLC (ZCLLC), a novel shared inclusive last-level cache (LLC) design for timing predictable multi-core platforms that offers lower worst-case latency (WCL) when compared to a traditional shared inclusive LLC design. ZCLLC achieves low WCL by eliminating certain memory operations in the form of cache line invalidations across the cache hierarchy that are a consequence of a core’s memory request that misses in the cache hierarchy and when there is no vacant entry in the LLC to accommodate the fetched data for this request. In addition to low WCL, ZCLLC offers performance benefits in the form of additional caching capacity and unlike state-of-the-art approaches, ZCLLC does not impose any constraints on its usage across multiple cores. In this work, we describe the impact of LLC cache line invalidations on the WCL and systematically build solutions to eliminate these invalidations resulting in ZCLLC. We also present ZCLLC, an optimized variant of ZCLLC that offers lower WCL and improved average-case performance over ZCLLC. We apply optimizations to the shared bus arbitration mechanism and extend the micro-architecture of ZCLLC to allow for overlapping memory requests to the main memory. Our analysis reveals that the analytical WCL of a memory request under ZCLLC is 87.0%, 93.8%, and 97.1% lower than that under state-of-the-art LLC partition sharing techniques for 2, 4, and 8 cores, respectively. ZCLLC shows average-case performance speedups of 1.89 ×, 3.36 ×, and 6.24 × compared to the state-of-the-art LLC partition sharing techniques for 2, 4, and 8 cores, respectively. When compared to the original ZCLLC that does not have any optimizations, ZCLLC shows lower analytical WCLs that are 76.5%, 82.6%, and 86.2% lower compared to ZCLLC-NORMAL for 2, 4, and 8 cores, respectively.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141929131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APB-tree: An Adaptive Pre-built Tree Indexing Scheme for NVM-based IoT Systems APB-tree:基于 NVM 的物联网系统的自适应预建树索引方案
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-26 DOI: 10.1145/3677179
Shih-Wen Hsu, Yen-Ting Chen, Kam-yiu Lam, Yuan-Hao Chang, W. Shih, Han-Chieh Chao
With the proliferation of sensors and the emergence of novel applications, IoT data has grown exponentially in recent years. Given this trend, efficient data management is crucial for a system to easily access vast amounts of information. For decades, B + -tree-based indexing schemes have been widely adopted for providing effective search in IoT systems. However, in systems with pre-distributed sensors, B + -tree-based indexes fail to optimally utilize the known IoT data distribution, leading to significant write overhead and energy consumption. Furthermore, as non-volatile memory (NVM) technology emerges as the alternative storage medium, the inherent write asymmetry of NVM leads to instability issues in IoT systems, especially for write-intensive applications. In this research, by considering the write overheads of tree-based indexing schemes and key-range distribution assumption, we rethink the design of the tree-based indexing schemes and propose an adaptive pre-built tree (APB-tree) indexing scheme to reduce the write overhead in serving insertion and deletion of keys in the NVM-Based IoT system. The APB-tree profiles the hot region of the key distribution from the known key range to pre-allocate the index structure that alleviates online index management costs and run-time index overhead. Meanwhile, the APB-tree maintains the scalability of a tree-based index structure to accommodate the large amount of new data brought by the additional nodes to the IoT system. Extensive experiments demonstrate that our solution achieves significant performance improvements in write operations while maintaining effective energy consumption in the NVM-based IoT system. We compare the energy and time required for basic key operations like Put(), Get(), and Delete() in APB-trees and B + -tree-based indexing schemes. Under workloads with varying ratios of these operations, the proposed design effectively reduces execution time by 47% to 72% and energy consumption by 11% to 72% compared to B + -tree-based indexing schemes.
近年来,随着传感器的普及和新型应用的出现,物联网数据呈指数级增长。鉴于这一趋势,高效的数据管理对于系统轻松访问海量信息至关重要。几十年来,基于 B + 树的索引方案已被广泛采用,用于在物联网系统中提供有效的搜索。然而,在具有预先分布的传感器的系统中,基于 B + 树的索引无法优化利用已知的物联网数据分布,从而导致大量的写入开销和能耗。此外,随着非易失性存储器(NVM)技术作为替代存储介质的出现,NVM 固有的写入不对称导致了物联网系统的不稳定性问题,尤其是对于写入密集型应用。在本研究中,通过考虑基于树的索引方案的写开销和密钥范围分布假设,我们重新思考了基于树的索引方案的设计,并提出了一种自适应预建树(APB-tree)索引方案,以减少在基于 NVM 的物联网系统中为插入和删除密钥服务时的写开销。APB 树根据已知密钥范围剖析密钥分布的热点区域,预先分配索引结构,从而降低在线索引管理成本和运行时索引开销。同时,APB 树保持了基于树的索引结构的可扩展性,以适应物联网系统新增节点带来的大量新数据。大量实验证明,我们的解决方案在基于 NVM 的物联网系统中保持有效能耗的同时,在写入操作方面实现了显著的性能提升。我们比较了在 APB 树和基于 B + 树的索引方案中进行 Put()、Get() 和 Delete() 等基本密钥操作所需的能量和时间。在这些操作比例各不相同的工作负载下,与基于 B + 树的索引方案相比,所提出的设计能有效缩短 47% 到 72% 的执行时间,降低 11% 到 72% 的能耗。
{"title":"APB-tree: An Adaptive Pre-built Tree Indexing Scheme for NVM-based IoT Systems","authors":"Shih-Wen Hsu, Yen-Ting Chen, Kam-yiu Lam, Yuan-Hao Chang, W. Shih, Han-Chieh Chao","doi":"10.1145/3677179","DOIUrl":"https://doi.org/10.1145/3677179","url":null,"abstract":"\u0000 With the proliferation of sensors and the emergence of novel applications, IoT data has grown exponentially in recent years. Given this trend, efficient data management is crucial for a system to easily access vast amounts of information. For decades, B\u0000 +\u0000 -tree-based indexing schemes have been widely adopted for providing effective search in IoT systems. However, in systems with pre-distributed sensors, B\u0000 +\u0000 -tree-based indexes fail to optimally utilize the known IoT data distribution, leading to significant write overhead and energy consumption. Furthermore, as non-volatile memory (NVM) technology emerges as the alternative storage medium, the inherent write asymmetry of NVM leads to instability issues in IoT systems, especially for write-intensive applications. In this research, by considering the write overheads of tree-based indexing schemes and key-range distribution assumption, we rethink the design of the tree-based indexing schemes and propose an adaptive pre-built tree (APB-tree) indexing scheme to reduce the write overhead in serving insertion and deletion of keys in the NVM-Based IoT system. The APB-tree profiles the hot region of the key distribution from the known key range to pre-allocate the index structure that alleviates online index management costs and run-time index overhead. Meanwhile, the APB-tree maintains the scalability of a tree-based index structure to accommodate the large amount of new data brought by the additional nodes to the IoT system. Extensive experiments demonstrate that our solution achieves significant performance improvements in write operations while maintaining effective energy consumption in the NVM-based IoT system. We compare the energy and time required for basic key operations like Put(), Get(), and Delete() in APB-trees and B\u0000 +\u0000 -tree-based indexing schemes. Under workloads with varying ratios of these operations, the proposed design effectively reduces execution time by 47% to 72% and energy consumption by 11% to 72% compared to B\u0000 +\u0000 -tree-based indexing schemes.\u0000","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141800666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Co-Approximator: Enabling Performance Prediction in Colocated Applications. Co-Approximator:实现同地应用的性能预测。
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-25 DOI: 10.1145/3677180
Rafiuzzaman Mohammad, S. Gopalakrishnan, Karthik Pattabiraman
Today’s Internet of Things (IoT) devices can colocate multiple applications on a platform with hardware resource sharing. Such colocations allow for increasing the throughput of contemporary IoT applications, similar to the use of multi-tenancy in clouds. However, avoiding performance interference among colocated applications through virtualized performance isolation is expensive in IoT platforms due to resource limitations. Hence, on the one hand, colocated IoT applications without performance isolation contend for shared limited resources, which makes their performance variance discontinuous and a priori unknown. On the other hand, different combinations of colocated applications make the overall state space exceedingly large. All of these make such colocated routines challenging to predict, making it difficult to plan which applications to colocate on which platform. We propose Co - Approximator , a technique for systematically sampling an exponentially large colocated application state space and efficiently approximating it from only four available complete colocation samples. We demonstrate the performance of Co - Approximator with seventeen standard benchmarks and three pipelined data processing applications on different IoT platforms, where on average, Co - Approximator reduces existing techniques’ approximation error from 61% to just 7%.
如今的物联网(IoT)设备可以将多个应用程序集中在一个平台上,实现硬件资源共享。这种共享可以提高当代物联网应用的吞吐量,类似于云中多租户的使用。然而,由于资源限制,在物联网平台中通过虚拟化性能隔离来避免共用应用程序之间的性能干扰代价高昂。因此,一方面,没有性能隔离的同地物联网应用会争夺共享的有限资源,这使得它们的性能差异不连续且先验未知。另一方面,不同组合的同地应用程序会使整体状态空间变得异常庞大。所有这些都使得这种同地运行的例程难以预测,从而难以规划在哪个平台上同地运行哪些应用程序。 我们提出了 Co - Approximator,这是一种系统地对指数级庞大的主机托管应用状态空间进行采样,并从仅有的四个完整主机托管采样中高效逼近的技术。我们在不同的物联网平台上使用 17 个标准基准和 3 个流水线数据处理应用演示了 Co - Approximator 的性能,平均而言,Co - Approximator 将现有技术的近似误差从 61% 降低到仅 7%。
{"title":"Co-Approximator: Enabling Performance Prediction in Colocated Applications.","authors":"Rafiuzzaman Mohammad, S. Gopalakrishnan, Karthik Pattabiraman","doi":"10.1145/3677180","DOIUrl":"https://doi.org/10.1145/3677180","url":null,"abstract":"Today’s Internet of Things (IoT) devices can colocate multiple applications on a platform with hardware resource sharing. Such colocations allow for increasing the throughput of contemporary IoT applications, similar to the use of multi-tenancy in clouds. However, avoiding performance interference among colocated applications through virtualized performance isolation is expensive in IoT platforms due to resource limitations. Hence, on the one hand, colocated IoT applications without performance isolation contend for shared limited resources, which makes their performance variance discontinuous and a priori unknown. On the other hand, different combinations of colocated applications make the overall state space exceedingly large. All of these make such colocated routines challenging to predict, making it difficult to plan which applications to colocate on which platform.\u0000 \u0000 We propose\u0000 Co\u0000 -\u0000 Approximator\u0000 , a technique for systematically sampling an exponentially large colocated application state space and efficiently approximating it from only four available complete colocation samples. We demonstrate the performance of\u0000 Co\u0000 -\u0000 Approximator\u0000 with seventeen standard benchmarks and three pipelined data processing applications on different IoT platforms, where on average,\u0000 Co\u0000 -\u0000 Approximator\u0000 reduces existing techniques’ approximation error from 61% to just 7%.\u0000","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141802265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trust Based Active Game Data Collection Scheme in Smart Cities 智能城市中基于信任的主动游戏数据收集方案
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-22 DOI: 10.1145/3677319
Zhuoqun Xia, Ziyu Wang, Xiao Liu
The concept of a smart city is to equip sensors to various objects in urban life to monitor areas and collect sensing data, and make wise decisions based on the collected data. However, some malicious sensor devices may interrupt and interfere with data collection, leading to a reduction in the integrity and availability of information, thereby causing harm to Internet of Things(IoT) applications. Therefore, identifying the credibility of sensor nodes to ensure the credibility of data collection is a challenge. This paper proposes a trust based active game data collection (TAGDC) scheme to collect trust data in the IoT. This TAGDC scheme mainly includes the following parts: 1)An active trust framework plus evolutionary game theory is proposed to encourage high-energy sensors to send detection routes and quickly obtain sensor trust. 2)In order to balance the data security requirements of subnetworks, the number and frequency of detection routes required by subnetworks are estimated through mechanism modeling and fuzzy analytic hierarchy process. 3)The design focuses on the internal trust computing model in the region to evaluate the trust of nodes. The findings of the experiment demonstrate that the TAGDC scheme, as described in this research study, enhances the accuracy of identifying malicious nodes by 20%, reduces the required identification time by 40%, and improves the data collection success rate by 5%.
智慧城市的概念是在城市生活中的各种物体上安装传感器,以监控区域和收集传感数据,并根据收集到的数据做出明智的决策。然而,一些恶意传感器设备可能会中断和干扰数据收集,导致信息的完整性和可用性降低,从而对物联网应用造成危害。因此,识别传感器节点的可信度以确保数据收集的可信度是一项挑战。本文提出了一种基于信任的主动游戏数据收集(TAGDC)方案,以收集物联网中的信任数据。该TAGDC方案主要包括以下几个部分:1)提出了一种主动信任框架加演化博弈论,鼓励高能传感器发送检测路由,快速获取传感器信任。2)为了平衡子网络的数据安全需求,通过机制建模和模糊层次分析法估算子网络所需的检测路由数量和频率。3)设计中重点采用区域内部信任计算模型来评估节点的信任度。实验结果表明,本研究中描述的 TAGDC 方案可将识别恶意节点的准确率提高 20%,所需识别时间缩短 40%,数据收集成功率提高 5%。
{"title":"Trust Based Active Game Data Collection Scheme in Smart Cities","authors":"Zhuoqun Xia, Ziyu Wang, Xiao Liu","doi":"10.1145/3677319","DOIUrl":"https://doi.org/10.1145/3677319","url":null,"abstract":"The concept of a smart city is to equip sensors to various objects in urban life to monitor areas and collect sensing data, and make wise decisions based on the collected data. However, some malicious sensor devices may interrupt and interfere with data collection, leading to a reduction in the integrity and availability of information, thereby causing harm to Internet of Things(IoT) applications. Therefore, identifying the credibility of sensor nodes to ensure the credibility of data collection is a challenge. This paper proposes a trust based active game data collection (TAGDC) scheme to collect trust data in the IoT. This TAGDC scheme mainly includes the following parts: 1)An active trust framework plus evolutionary game theory is proposed to encourage high-energy sensors to send detection routes and quickly obtain sensor trust. 2)In order to balance the data security requirements of subnetworks, the number and frequency of detection routes required by subnetworks are estimated through mechanism modeling and fuzzy analytic hierarchy process. 3)The design focuses on the internal trust computing model in the region to evaluate the trust of nodes. The findings of the experiment demonstrate that the TAGDC scheme, as described in this research study, enhances the accuracy of identifying malicious nodes by 20%, reduces the required identification time by 40%, and improves the data collection success rate by 5%.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141814431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PredATW: Predicting the Asynchronous Time Warp Latency For VR Systems PredATW:预测虚拟现实系统的异步时变延迟
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-19 DOI: 10.1145/3677329
Akanksha Dixit, S. Sarangi
With the advent of low-power ultra-fast hardware and GPUs, virtual reality (VR) has gained a lot of prominence in the last few years and is being used in various areas such as education, entertainment, scientific visualization, and computer-aided design. VR-based applications are highly interactive, and one of the most important performance metrics for these applications is the motion-to-photon-delay (MPD). MPD is the delay from the user’s head movement to the time at which the image gets updated on the VR screen. Since the human visual system can even detect an error of a few pixels (very spatially sensitive), the MPD should be as small as possible. Popular VR vendors use the GPU-accelerated Asynchronous Time Warp (ATW) algorithm to reduce the MPD. ATW reduces the MPD if and only if the warping operation finishes just before the display refreshes. However, due to the competition between the different constituent applications for the single, shared GPU, the GPU-accelerated ATW algorithm suffers from an unpredictable ATW latency, making it challenging to find the ideal time instance for starting the time warp and ensuring that it completes with the least amount of lag relative to the screen refresh. Hence, the state-of-the-art is to use a separate hardware unit for the time warping operation. Our approach, PredATW , uses an ML-based hardware predictor to predict the ATW latency for a VR application, and then schedule it as late as possible while running the time warping operation on the GPU itself. This is the first work to do so. Our predictor achieves an error of only 0.22 ms across several popular VR applications for predicting the ATW latency. As compared to the baseline architecture, we reduce deadline misses by 80.6%.
随着低功耗超高速硬件和 GPU 的出现,虚拟现实(VR)在过去几年中得到了广泛应用,并被广泛应用于教育、娱乐、科学可视化和计算机辅助设计等领域。基于 VR 的应用具有很强的交互性,而这些应用最重要的性能指标之一就是运动光子延迟(MPD)。 MPD 是指从用户头部移动到 VR 屏幕上图像更新的延迟时间。由于人类视觉系统甚至可以检测到几个像素的误差(对空间非常敏感),因此 MPD 应尽可能小。 流行的 VR 供应商使用 GPU 加速的异步时间扭曲(ATW)算法来减少 MPD。当且仅当扭曲操作在显示屏刷新前完成时,ATW 才会减少 MPD。然而,由于不同组成应用之间对单个共享 GPU 的竞争,GPU 加速的 ATW 算法存在不可预测的 ATW 延迟,这使得找到理想的时间实例来启动时间扭曲并确保以相对于屏幕刷新最小的延迟完成时间扭曲具有挑战性。因此,最先进的方法是使用单独的硬件单元进行时间扭曲操作。我们的方法,即 PredATW,使用基于 ML 的硬件预测器来预测 VR 应用程序的 ATW 延迟,然后在 GPU 本身运行时间扭曲操作的同时尽可能晚地安排它。这是第一项这样做的工作。我们的预测器在几款流行的 VR 应用程序中预测 ATW 延迟的误差仅为 0.22 毫秒。与基线架构相比,我们减少了 80.6% 的最后期限错过。
{"title":"PredATW: Predicting the Asynchronous Time Warp Latency For VR Systems","authors":"Akanksha Dixit, S. Sarangi","doi":"10.1145/3677329","DOIUrl":"https://doi.org/10.1145/3677329","url":null,"abstract":"With the advent of low-power ultra-fast hardware and GPUs, virtual reality (VR) has gained a lot of prominence in the last few years and is being used in various areas such as education, entertainment, scientific visualization, and computer-aided design. VR-based applications are highly interactive, and one of the most important performance metrics for these applications is the motion-to-photon-delay (MPD). MPD is the delay from the user’s head movement to the time at which the image gets updated on the VR screen. Since the human visual system can even detect an error of a few pixels (very spatially sensitive), the MPD should be as small as possible.\u0000 \u0000 Popular VR vendors use the GPU-accelerated Asynchronous Time Warp (ATW) algorithm to reduce the MPD. ATW reduces the MPD if and only if the warping operation finishes just before the display refreshes. However, due to the competition between the different constituent applications for the single, shared GPU, the GPU-accelerated ATW algorithm suffers from an unpredictable ATW latency, making it challenging to find the ideal time instance for starting the time warp and ensuring that it completes with the least amount of lag relative to the screen refresh. Hence, the state-of-the-art is to use a separate hardware unit for the time warping operation. Our approach,\u0000 PredATW\u0000 , uses an ML-based hardware predictor to predict the ATW latency for a VR application, and then schedule it as late as possible while running the time warping operation on the GPU itself. This is the first work to do so. Our predictor achieves an error of only 0.22 ms across several popular VR applications for predicting the ATW latency. As compared to the baseline architecture, we reduce deadline misses by 80.6%.\u0000","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141822324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight Champions of the World: Side-Channel Resistant Open Hardware for Finalists in the NIST Lightweight Cryptography Standardization Process 轻量级世界冠军:NIST 轻量级密码标准化进程入围者的抗侧信道开放硬件
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-17 DOI: 10.1145/3677320
Kamyar Mohajerani, Luke Beckwith, Abubakr Abdulgadir, J. Kaps, K. Gaj
Cryptographic competitions played a significant role in stimulating the development and release of open hardware for cryptography. The primary reason was the focus of standardization organizations and other contest organizers on transparency and fairness of hardware benchmarking, which could be achieved only with all source code made available for public scrutiny. Consequently, the number and quality of open-source hardware implementations developed during subsequent major competitions, such as AES, SHA-3, and CAESAR, have steadily increased. However, most of these implementations were still quite far from being used in future products due to the lack of countermeasures against side-channel analysis (SCA). In this paper, we discuss the first coordinated effort at developing SCA-resistant open hardware for all finalists of a cryptographic standardization process. The developed hardware is then evaluated by independent labs for information leakage and resilience to selected attacks. Our target included the ten finalists of the NIST Lightweight Cryptography Standardization Process. The authors’ contributions included formulating detailed requirements, publicizing the submissions, matching open hardware with suitable SCA-evaluation labs, developing a subset of all implementations, serving as one of the six evaluation labs, performing FPGA benchmarking of all protected and unprotected implementations, and summarizing results in the comprehensive report. Our results confirm that NIST made the right decision in selecting Ascon as a future lightweight cryptography standard. They also indicate that at least three other algorithms, Xoodyak, TinyJAMBU, and ISAP, were very strong competitors and outperformed Ascon in at least one of the evaluated performance metrics.
密码学竞赛在促进密码学开放硬件的开发和发布方面发挥了重要作用。主要原因是标准化组织和其他竞赛组织者注重硬件基准测试的透明度和公平性,而这只有在所有源代码都公开供公众审查的情况下才能实现。因此,在 AES、SHA-3 和 CAESAR 等后续大型竞赛中开发的开源硬件实现的数量和质量都稳步提高。然而,由于缺乏针对侧信道分析(SCA)的对抗措施,这些实现中的大多数离在未来产品中使用还很遥远。在本文中,我们讨论了为密码标准化进程的所有入围者开发抗 SCA 开放硬件的首次协调努力。然后,由独立实验室对所开发的硬件进行信息泄露和抵御选定攻击的能力评估。我们的目标包括 NIST 轻量级密码标准化进程的十个入围者。作者的贡献包括制定详细的要求、公布提交材料、将开放硬件与合适的 SCA 评估实验室相匹配、开发所有实现的子集、作为六个评估实验室之一、对所有受保护和未受保护的实现进行 FPGA 基准测试,以及在综合报告中总结结果。我们的研究结果证明,NIST 选择 Ascon 作为未来的轻量级密码标准是正确的。这些结果还表明,至少有另外三种算法(Xoodyak、TinyJAMBU 和 ISAP)是非常强劲的竞争对手,它们至少在一项评估的性能指标上优于 Ascon。
{"title":"Lightweight Champions of the World: Side-Channel Resistant Open Hardware for Finalists in the NIST Lightweight Cryptography Standardization Process","authors":"Kamyar Mohajerani, Luke Beckwith, Abubakr Abdulgadir, J. Kaps, K. Gaj","doi":"10.1145/3677320","DOIUrl":"https://doi.org/10.1145/3677320","url":null,"abstract":"Cryptographic competitions played a significant role in stimulating the development and release of open hardware for cryptography. The primary reason was the focus of standardization organizations and other contest organizers on transparency and fairness of hardware benchmarking, which could be achieved only with all source code made available for public scrutiny. Consequently, the number and quality of open-source hardware implementations developed during subsequent major competitions, such as AES, SHA-3, and CAESAR, have steadily increased. However, most of these implementations were still quite far from being used in future products due to the lack of countermeasures against side-channel analysis (SCA). In this paper, we discuss the first coordinated effort at developing SCA-resistant open hardware for all finalists of a cryptographic standardization process. The developed hardware is then evaluated by independent labs for information leakage and resilience to selected attacks. Our target included the ten finalists of the NIST Lightweight Cryptography Standardization Process. The authors’ contributions included formulating detailed requirements, publicizing the submissions, matching open hardware with suitable SCA-evaluation labs, developing a subset of all implementations, serving as one of the six evaluation labs, performing FPGA benchmarking of all protected and unprotected implementations, and summarizing results in the comprehensive report. Our results confirm that NIST made the right decision in selecting Ascon as a future lightweight cryptography standard. They also indicate that at least three other algorithms, Xoodyak, TinyJAMBU, and ISAP, were very strong competitors and outperformed Ascon in at least one of the evaluated performance metrics.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141830078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance and Communication Cost of Hardware Accelerators for Hashing in Post-Quantum Cryptography 后量子密码学哈希算法硬件加速器的性能和通信成本
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-09 DOI: 10.1145/3676965
Patrick Karl, Jonas Schupp, Georg Sigl
SPHINCS+ is a signature scheme included in the first NIST post-quantum standard, that bases its security on the underlying hash primitive. As most of the runtime of SPHINCS+ is caused by the evaluation of several hash- and pseudo-random functions, offloading this computation to dedicated hardware accelerators is a natural step. In this work, we evaluate different architectures for hardware acceleration of such a hash primitive with respect to its use-case and evaluate them in the context of SPHINCS+. We attach hardware accelerators for different hash primitives (SHAKE128 and Ascon-Xof for both, full and round-reduced versions) to CPU interfaces having different transfer speeds. We show, that for most use-cases, data transfer determines the overall performance if accelerators are equipped with FIFOs and that reducing the number of rounds in the permutation does not necessarily lead to significant performance improvements when using hardware acceleration. This work extends on a conference paper accepted at COSADE’24, first published in [19], and written by the same authors, where different architectures for hardware accelerators of hash functions are benchmarked and evaluated for SPHINCS+ as a case study. In this paper, we provide results for additional parameter sets for SPHINCS+ and improve the performance of one of the accelerators by adding an additional RISC-V instruction for faster absorption. We then extend the performance benchmark by including the algorithms CRYSTALS-Kyber, CRYSTALS-Dilithium and Falcon. Finally we provide a power/energy comparison for the accelerators.
SPHINCS+ 是首个 NIST 后量子标准中的签名方案,其安全性基于底层散列原语。由于 SPHINCS+ 的大部分运行时间是由多个哈希和伪随机函数的评估造成的,因此将这种计算卸载到专用硬件加速器上是很自然的一步。在这项工作中,我们根据哈希基元的使用情况评估了硬件加速哈希基元的不同架构,并在 SPHINCS+ 的背景下对它们进行了评估。我们将不同散列原语(SHAKE128 和 Ascon-Xof,包括完整版和回合缩减版)的硬件加速器连接到具有不同传输速度的 CPU 接口上。我们的研究表明,对于大多数用例,如果加速器配备了先进先出设备,数据传输将决定整体性能,而且在使用硬件加速时,减少排列中的轮数并不一定能显著提高性能。这项工作是对 COSADE'24 会议接受的一篇会议论文的延伸,该论文首次发表于 [19],由同一作者撰写,其中以 SPHINCS+ 为案例研究,对散列式函数硬件加速器的不同架构进行了基准测试和评估。在本文中,我们为 SPHINCS+ 提供了额外参数集的结果,并通过添加额外的 RISC-V 指令提高了其中一个加速器的性能,从而加快了吸收速度。然后,我们扩展了性能基准,纳入了 CRYSTALS-Kyber、CRYSTALS-Dilithium 和 Falcon 算法。最后,我们提供了加速器的功耗/能耗比较。
{"title":"Performance and Communication Cost of Hardware Accelerators for Hashing in Post-Quantum Cryptography","authors":"Patrick Karl, Jonas Schupp, Georg Sigl","doi":"10.1145/3676965","DOIUrl":"https://doi.org/10.1145/3676965","url":null,"abstract":"SPHINCS+ is a signature scheme included in the first NIST post-quantum standard, that bases its security on the underlying hash primitive. As most of the runtime of SPHINCS+ is caused by the evaluation of several hash- and pseudo-random functions, offloading this computation to dedicated hardware accelerators is a natural step. In this work, we evaluate different architectures for hardware acceleration of such a hash primitive with respect to its use-case and evaluate them in the context of SPHINCS+. We attach hardware accelerators for different hash primitives (SHAKE128 and Ascon-Xof for both, full and round-reduced versions) to CPU interfaces having different transfer speeds. We show, that for most use-cases, data transfer determines the overall performance if accelerators are equipped with FIFOs and that reducing the number of rounds in the permutation does not necessarily lead to significant performance improvements when using hardware acceleration.\u0000 This work extends on a conference paper accepted at COSADE’24, first published in [19], and written by the same authors, where different architectures for hardware accelerators of hash functions are benchmarked and evaluated for SPHINCS+ as a case study. In this paper, we provide results for additional parameter sets for SPHINCS+ and improve the performance of one of the accelerators by adding an additional RISC-V instruction for faster absorption. We then extend the performance benchmark by including the algorithms CRYSTALS-Kyber, CRYSTALS-Dilithium and Falcon. Finally we provide a power/energy comparison for the accelerators.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141663609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Embedded Computing Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1