Lu Li, Qi Tian, Guofeng Qin, Shuaiyu Chen, Weijia Wang
{"title":"Dilithium 紧凑型指令集扩展","authors":"Lu Li, Qi Tian, Guofeng Qin, Shuaiyu Chen, Weijia Wang","doi":"10.1145/3643826","DOIUrl":null,"url":null,"abstract":"<p>Post-quantum cryptography is considered to provide security against both traditional and quantum computer attacks. Dilithium is a digital signature algorithm that derives its security from the challenge of finding short vectors in lattices. It has been selected as one of the standardizations in the NIST post-quantum cryptography project. Hardware-software co-design is a commonly adopted implementation strategy to address various implementation challenges, including limited resources, high performance, and flexibility requirements. In this study, we investigate using compact instruction set extensions (ISEs) for Dilithium, aiming to improve software efficiency with low hardware overheads. To begin with, we propose tightly coupled accelerators that are deeply integrated into the RISC-V processor. These accelerators target the most computationally demanding components in resource-constrained processors, such as polynomial generation, Number Theoretic Transform (NTT), and modular arithmetic. Next, we design a set of custom instructions that seamlessly integrate with the RISC-V base instruction formats, completing the accelerators in a compact manner. Subsequently, we implement our ISEs in a chip design for the Hummingbird E203 core and conduct performance benchmarks for Dilithium utilizing these ISEs. Additionally, we evaluate the resource consumption of the ISEs on FPGA and ASIC technologies. Compared to the reference software implementation on the RISC-V core, our co-design demonstrates a remarkable speedup factor ranging from 6.95 to 9.96. This significant improvement in performance is achieved by incorporating additional hardware resources, specifically, a \\(35\\% \\) increase in LUTs, a \\(14\\% \\) increase in FFs, 7 additional DSPs, and no additional RAM. Furthermore, compared to the state-of-the-art approach, our work achieves faster speed performance with a reduced circuit cost. Specifically, the usage of additional LUTs, FFs, and RAMs is reduced by \\(47.53\\% \\), \\(50.43\\% \\), and \\(100\\% \\), respectively. On ASIC technology, our approach demonstrates 12 412 cell counts. Our co-design provides a better trade-off implementation on speed performance and circuit overheads.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compact Instruction Set Extensions for Dilithium\",\"authors\":\"Lu Li, Qi Tian, Guofeng Qin, Shuaiyu Chen, Weijia Wang\",\"doi\":\"10.1145/3643826\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Post-quantum cryptography is considered to provide security against both traditional and quantum computer attacks. Dilithium is a digital signature algorithm that derives its security from the challenge of finding short vectors in lattices. It has been selected as one of the standardizations in the NIST post-quantum cryptography project. Hardware-software co-design is a commonly adopted implementation strategy to address various implementation challenges, including limited resources, high performance, and flexibility requirements. In this study, we investigate using compact instruction set extensions (ISEs) for Dilithium, aiming to improve software efficiency with low hardware overheads. To begin with, we propose tightly coupled accelerators that are deeply integrated into the RISC-V processor. These accelerators target the most computationally demanding components in resource-constrained processors, such as polynomial generation, Number Theoretic Transform (NTT), and modular arithmetic. Next, we design a set of custom instructions that seamlessly integrate with the RISC-V base instruction formats, completing the accelerators in a compact manner. Subsequently, we implement our ISEs in a chip design for the Hummingbird E203 core and conduct performance benchmarks for Dilithium utilizing these ISEs. Additionally, we evaluate the resource consumption of the ISEs on FPGA and ASIC technologies. Compared to the reference software implementation on the RISC-V core, our co-design demonstrates a remarkable speedup factor ranging from 6.95 to 9.96. This significant improvement in performance is achieved by incorporating additional hardware resources, specifically, a \\\\(35\\\\% \\\\) increase in LUTs, a \\\\(14\\\\% \\\\) increase in FFs, 7 additional DSPs, and no additional RAM. Furthermore, compared to the state-of-the-art approach, our work achieves faster speed performance with a reduced circuit cost. Specifically, the usage of additional LUTs, FFs, and RAMs is reduced by \\\\(47.53\\\\% \\\\), \\\\(50.43\\\\% \\\\), and \\\\(100\\\\% \\\\), respectively. On ASIC technology, our approach demonstrates 12 412 cell counts. Our co-design provides a better trade-off implementation on speed performance and circuit overheads.</p>\",\"PeriodicalId\":50914,\"journal\":{\"name\":\"ACM Transactions on Embedded Computing Systems\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-02-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Embedded Computing Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3643826\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3643826","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Post-quantum cryptography is considered to provide security against both traditional and quantum computer attacks. Dilithium is a digital signature algorithm that derives its security from the challenge of finding short vectors in lattices. It has been selected as one of the standardizations in the NIST post-quantum cryptography project. Hardware-software co-design is a commonly adopted implementation strategy to address various implementation challenges, including limited resources, high performance, and flexibility requirements. In this study, we investigate using compact instruction set extensions (ISEs) for Dilithium, aiming to improve software efficiency with low hardware overheads. To begin with, we propose tightly coupled accelerators that are deeply integrated into the RISC-V processor. These accelerators target the most computationally demanding components in resource-constrained processors, such as polynomial generation, Number Theoretic Transform (NTT), and modular arithmetic. Next, we design a set of custom instructions that seamlessly integrate with the RISC-V base instruction formats, completing the accelerators in a compact manner. Subsequently, we implement our ISEs in a chip design for the Hummingbird E203 core and conduct performance benchmarks for Dilithium utilizing these ISEs. Additionally, we evaluate the resource consumption of the ISEs on FPGA and ASIC technologies. Compared to the reference software implementation on the RISC-V core, our co-design demonstrates a remarkable speedup factor ranging from 6.95 to 9.96. This significant improvement in performance is achieved by incorporating additional hardware resources, specifically, a \(35\% \) increase in LUTs, a \(14\% \) increase in FFs, 7 additional DSPs, and no additional RAM. Furthermore, compared to the state-of-the-art approach, our work achieves faster speed performance with a reduced circuit cost. Specifically, the usage of additional LUTs, FFs, and RAMs is reduced by \(47.53\% \), \(50.43\% \), and \(100\% \), respectively. On ASIC technology, our approach demonstrates 12 412 cell counts. Our co-design provides a better trade-off implementation on speed performance and circuit overheads.
期刊介绍:
The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.