Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.359-383
Olivier Bronchain, M. Azouaoui, Mohamed ElGhamrawy, Joost Renes, Tobias Schneider
We present a set of physical profiled attacks against CRYSTALS-Dilithium that accumulate noisy knowledge on secret keys over multiple signatures, finally leading to a full key recovery attack. The methodology is composed of two steps. The first step consists of observing or inserting a bias in the posterior distribution of sensitive variables. The second step is an information processing phase which is based on belief propagation and effectively exploits that bias. The proposed concrete attacks rely on side-channel information, induced faults or possibly a combination of the two. Interestingly, the adversary benefits most from this previous knowledge when targeting the released signatures, however, the latter are not strictly necessary. We show that the combination of a physical attack with the binary knowledge of acceptance or rejection of a signature also leads to exploitable information on the secret key. Finally, we demonstrate that this approach is also effective against shuffled implementations of CRYSTALS-Dilithium.
{"title":"Exploiting Small-Norm Polynomial Multiplication with Physical Attacks: Application to CRYSTALS-Dilithium","authors":"Olivier Bronchain, M. Azouaoui, Mohamed ElGhamrawy, Joost Renes, Tobias Schneider","doi":"10.46586/tches.v2024.i2.359-383","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.359-383","url":null,"abstract":"We present a set of physical profiled attacks against CRYSTALS-Dilithium that accumulate noisy knowledge on secret keys over multiple signatures, finally leading to a full key recovery attack. The methodology is composed of two steps. The first step consists of observing or inserting a bias in the posterior distribution of sensitive variables. The second step is an information processing phase which is based on belief propagation and effectively exploits that bias. The proposed concrete attacks rely on side-channel information, induced faults or possibly a combination of the two. Interestingly, the adversary benefits most from this previous knowledge when targeting the released signatures, however, the latter are not strictly necessary. We show that the combination of a physical attack with the binary knowledge of acceptance or rejection of a signature also leads to exploitable information on the secret key. Finally, we demonstrate that this approach is also effective against shuffled implementations of CRYSTALS-Dilithium.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"47 6","pages":"1545"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140249766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.588-629
Darius Mercadier, Viet Sang Nguyen, Matthieu Rivain, A. Udovenko
Software obfuscation is a powerful tool to protect the intellectual property or secret keys inside programs. Strong software obfuscation is crucial in the context of untrusted execution environments (e.g., subject to malware infection) or to face potentially malicious users trying to reverse-engineer a sensitive program. Unfortunately, the state-of-the-art of pure software-based obfuscation (including white-box cryptography) is either insecure or infeasible in practice.This work introduces OBSCURE, a versatile framework for practical and cryptographically strong software obfuscation relying on a simple stateless secure element (to be embedded, for example, in a protected hardware chip or a token). Based on the foundational result by Goyal et al. from TCC 2010, our scheme enjoys provable security guarantees, and further focuses on practical aspects, such as efficient execution of the obfuscated programs, while maintaining simplicity of the secure element. In particular, we propose a new rectangular universalization technique, which is also of independent interest. We provide an implementation of OBSCURE taking as input a program source code written in a subset of the C programming language. This ensures usability and a broad range of applications of our framework. We benchmark the obfuscation on simple software programs as well as on cryptographic primitives, hence highlighting the possible use cases of the framework as an alternative to pure software-based white-box implementations.
{"title":"OBSCURE: Versatile Software Obfuscation from a Lightweight Secure Element","authors":"Darius Mercadier, Viet Sang Nguyen, Matthieu Rivain, A. Udovenko","doi":"10.46586/tches.v2024.i2.588-629","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.588-629","url":null,"abstract":"Software obfuscation is a powerful tool to protect the intellectual property or secret keys inside programs. Strong software obfuscation is crucial in the context of untrusted execution environments (e.g., subject to malware infection) or to face potentially malicious users trying to reverse-engineer a sensitive program. Unfortunately, the state-of-the-art of pure software-based obfuscation (including white-box cryptography) is either insecure or infeasible in practice.This work introduces OBSCURE, a versatile framework for practical and cryptographically strong software obfuscation relying on a simple stateless secure element (to be embedded, for example, in a protected hardware chip or a token). Based on the foundational result by Goyal et al. from TCC 2010, our scheme enjoys provable security guarantees, and further focuses on practical aspects, such as efficient execution of the obfuscated programs, while maintaining simplicity of the secure element. In particular, we propose a new rectangular universalization technique, which is also of independent interest. We provide an implementation of OBSCURE taking as input a program source code written in a subset of the C programming language. This ensures usability and a broad range of applications of our framework. We benchmark the obfuscation on simple software programs as well as on cryptographic primitives, hence highlighting the possible use cases of the framework as an alternative to pure software-based white-box implementations.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"272 1","pages":"77"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140250022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.329-358
Hao Cheng, D. Page
Even given a state-of-the-art masking scheme, masked software implementation of some cryptography functionality can pose significant challenges stemming, e.g., from simultaneous requirements for efficiency and security. In this paper we design an Instruction Set Extension (ISE) to address a specific element of said challenge, namely the elimination of leakage stemming from architectural and microarchitectural overwriting. Conceptually, the ISE allows a leakage-focused behavioural hint to be communicated from software to the micro-architecture: using it informs how computation is realised when applied to masking-specific data, which then offers an opportunity to eliminate associated leakage. We develop prototype, latencyand area-optimised implementations of the ISE design based on the RISC-V Ibex core. Using them, we demonstrate that use of the ISE can close the gap between assumptions about and actual behaviour of a device and thereby deliver an improved security guarantee.
即使采用最先进的掩码方案,某些加密功能的掩码软件实现也会面临巨大挑战,例如,同时满足效率和安全性要求。在本文中,我们设计了一种指令集扩展(ISE)来应对上述挑战中的一个特定因素,即消除因架构和微架构覆盖而产生的泄漏。从概念上讲,ISE 允许从软件向微体系结构传递以泄密为重点的行为提示:使用它可以了解在应用于特定掩码数据时如何实现计算,从而提供消除相关泄密的机会。我们开发了基于 RISC-V Ibex 内核的 ISE 设计原型、延迟和区域优化实现。通过使用它们,我们证明了 ISE 的使用可以缩小设备假设与实际行为之间的差距,从而提供更好的安全保证。
{"title":"eLIMInate: a Leakage-focused ISE for Masked Implementation","authors":"Hao Cheng, D. Page","doi":"10.46586/tches.v2024.i2.329-358","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.329-358","url":null,"abstract":"Even given a state-of-the-art masking scheme, masked software implementation of some cryptography functionality can pose significant challenges stemming, e.g., from simultaneous requirements for efficiency and security. In this paper we design an Instruction Set Extension (ISE) to address a specific element of said challenge, namely the elimination of leakage stemming from architectural and microarchitectural overwriting. Conceptually, the ISE allows a leakage-focused behavioural hint to be communicated from software to the micro-architecture: using it informs how computation is realised when applied to masking-specific data, which then offers an opportunity to eliminate associated leakage. We develop prototype, latencyand area-optimised implementations of the ISE design based on the RISC-V Ibex core. Using them, we demonstrate that use of the ISE can close the gap between assumptions about and actual behaviour of a device and thereby deliver an improved security guarantee.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"57 3","pages":"966"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140248471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MAYO is a popular high-calorie condiment as well as an auspicious candidate in the ongoing NIST competition for additional post-quantum signature schemes achieving competitive signature and public key sizes. In this work, we present high-speed implementations of MAYO using the AVX2 and Armv7E-M instruction sets targeting recent x86 platforms and the Arm Cortex-M4. Moreover, the main contribution of our work is showing that MAYO can be even faster when switching from a bitsliced representation of keys to a nibble-sliced representation. While the bitsliced representation was primarily motivated by faster arithmetic on microcontrollers, we show that it is not necessary for achieving high performance on Cortex-M4. On Cortex-M4, we instead propose to implement the large matrix multiplications of MAYO using the Method of the Four Russians (M4R), which allows us to achieve better performance than when using the bitsliced approach. This results in up to 21% faster signing. For AVX2, the change in representation allows us to implement the arithmetic much faster using shuffle instructions. Signing takes up to 3.2x fewer cycles and key generation and verification enjoy similar speedups. This shows that MAYO is competitive with lattice-based signature schemes on x86 CPUs, and a factor of 2-6 slower than lattice-based signature schemes on Cortex-M4 (which can still be considered competitive).
MAYO 是一种广受欢迎的高热量调味品,也是美国国家标准与技术研究院(NIST)正在进行的后量子签名方案竞赛中的吉祥候选方案,该方案可实现具有竞争力的签名和公钥大小。在这项工作中,我们介绍了使用 AVX2 和 Armv7E-M 指令集针对最新 x86 平台和 Arm Cortex-M4 的 MAYO 高速实现。此外,我们工作的主要贡献在于证明了 MAYO 在从密钥的比特切片表示法转换为尼布尔切片表示法时速度更快。虽然比特切片表示法的主要动机是提高微控制器的运算速度,但我们的研究表明,在 Cortex-M4 上实现高性能并不需要比特切片表示法。相反,我们建议在 Cortex-M4 上使用四俄方法 (M4R) 实现 MAYO 的大型矩阵乘法,这样就能获得比使用位切分方法更好的性能。这使得签名速度提高了 21%。对于 AVX2,表示方法的改变使我们能够使用洗牌指令更快地实现算术运算。签名所需的周期减少了 3.2 倍,密钥生成和验证的速度也得到了类似的提升。这表明 MAYO 在 x86 CPU 上与基于网格的签名方案相比具有竞争力,在 Cortex-M4 上比基于网格的签名方案慢 2-6 倍(仍可认为具有竞争力)。
{"title":"Nibbling MAYO: Optimized Implementations for AVX2 and Cortex-M4","authors":"Ward Beullens, Fabio Campos, Sof´ıa Celi, Basil Hess, Matthias J. Kannwischer","doi":"10.46586/tches.v2024.i2.252-275","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.252-275","url":null,"abstract":"MAYO is a popular high-calorie condiment as well as an auspicious candidate in the ongoing NIST competition for additional post-quantum signature schemes achieving competitive signature and public key sizes. In this work, we present high-speed implementations of MAYO using the AVX2 and Armv7E-M instruction sets targeting recent x86 platforms and the Arm Cortex-M4. Moreover, the main contribution of our work is showing that MAYO can be even faster when switching from a bitsliced representation of keys to a nibble-sliced representation. While the bitsliced representation was primarily motivated by faster arithmetic on microcontrollers, we show that it is not necessary for achieving high performance on Cortex-M4. On Cortex-M4, we instead propose to implement the large matrix multiplications of MAYO using the Method of the Four Russians (M4R), which allows us to achieve better performance than when using the bitsliced approach. This results in up to 21% faster signing. For AVX2, the change in representation allows us to implement the arithmetic much faster using shuffle instructions. Signing takes up to 3.2x fewer cycles and key generation and verification enjoy similar speedups. This shows that MAYO is competitive with lattice-based signature schemes on x86 CPUs, and a factor of 2-6 slower than lattice-based signature schemes on Cortex-M4 (which can still be considered competitive).","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"27 2","pages":"1683"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140250018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.215-251
S. Deshpande, James Howe, Jakub Szefer, Dongze Yue
This work presents the first hardware realisation of the Syndrome-Decodingin-the-Head (SDitH) signature scheme, which is a candidate in the NIST PQC process for standardising post-quantum secure digital signature schemes. SDitH’s hardness is based on conservative code-based assumptions, and it uses the Multi-Party-Computation-in-the-Head (MPCitH) construction. This is the first hardware design of a code-based signature scheme based on traditional decoding problems and only the second for MPCitH constructions, after Picnic. This work presents optimised designs to achieve the best area efficiency, which we evaluate using the Time-Area Product (TAP) metric. This work also proposes a novel hardware architecture by dividing the signature generation algorithm into two phases, namely offline and online phases for optimising the overall clock cycle count. The hardware designs for key generation, signature generation, and signature verification are parameterised for all SDitH parameters, including the NIST security levels, both syndrome decoding base fields (GF256 and GF251), and thus conforms to the SDitH specifications. The hardware design further supports secret share splitting, and the hypercube optimisation which can be applied in this and multiple other NIST PQC candidates. The results of this work result in a hardware design with a drastic reducing in clock cycles compared to the optimised AVX2 software implementation, in the range of 2-4x for most operations. Our key generation outperforms software drastically, giving a 11-17x reduction in runtime, despite the significantly faster clock speed. On Artix 7 FPGAs we can perform key generation in 55.1 Kcycles, signature generation in 6.7 Mcycles, and signature verification in 8.6 Mcycles for NIST L1 parameters, which increase for GF251, and for L3 and L5 parameters.
{"title":"SDitH in Hardware","authors":"S. Deshpande, James Howe, Jakub Szefer, Dongze Yue","doi":"10.46586/tches.v2024.i2.215-251","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.215-251","url":null,"abstract":"This work presents the first hardware realisation of the Syndrome-Decodingin-the-Head (SDitH) signature scheme, which is a candidate in the NIST PQC process for standardising post-quantum secure digital signature schemes. SDitH’s hardness is based on conservative code-based assumptions, and it uses the Multi-Party-Computation-in-the-Head (MPCitH) construction. This is the first hardware design of a code-based signature scheme based on traditional decoding problems and only the second for MPCitH constructions, after Picnic. This work presents optimised designs to achieve the best area efficiency, which we evaluate using the Time-Area Product (TAP) metric. This work also proposes a novel hardware architecture by dividing the signature generation algorithm into two phases, namely offline and online phases for optimising the overall clock cycle count. The hardware designs for key generation, signature generation, and signature verification are parameterised for all SDitH parameters, including the NIST security levels, both syndrome decoding base fields (GF256 and GF251), and thus conforms to the SDitH specifications. The hardware design further supports secret share splitting, and the hypercube optimisation which can be applied in this and multiple other NIST PQC candidates. The results of this work result in a hardware design with a drastic reducing in clock cycles compared to the optimised AVX2 software implementation, in the range of 2-4x for most operations. Our key generation outperforms software drastically, giving a 11-17x reduction in runtime, despite the significantly faster clock speed. On Artix 7 FPGAs we can perform key generation in 55.1 Kcycles, signature generation in 6.7 Mcycles, and signature verification in 8.6 Mcycles for NIST L1 parameters, which increase for GF251, and for L3 and L5 parameters.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"98 4","pages":"69"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140249566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.630-653
Jiangxue Liu, Cankun Zhao, Shuohang Peng, Bohan Yang, Hang Zhao, Xiangdong Han, Min Zhu, Shaojun Wei, Leibo Liu
Masking, an effective countermeasure against side-channel attacks, is commonly applied in modern cryptographic implementations. Considering cryptographic algorithms that utilize both Boolean and arithmetic masking, the conversion algorithm between arithmetic masking and Boolean masking is required. Conventional high-order arithmetic masking to Boolean masking conversion algorithms based on Boolean circuits suffer from performance overhead, especially in terms of hardware implementation. In this work, we analyze high latency for the conversion and propose an improved high-order A2B conversion algorithm. For the conversion of 16-bit variables, the hardware latency can be reduced by 47% in the best scenario. For the case study of second-order 32-bit conversion, the implementation results show that the improved scheme reduces the clock cycle latency by 42% in hardware and achieves a 30% speed performance improvement in software. Theoretically, a security proof of arbitrary order is provided for the proposed high-order A2B conversion. Experimental validations are performed to verify the second-order DPA resistance of second-order implementation. The Test Vector Leakage Assessment does not observe side-channel leakage for hardware and software implementations.
{"title":"A Low-Latency High-Order Arithmetic to Boolean Masking Conversion","authors":"Jiangxue Liu, Cankun Zhao, Shuohang Peng, Bohan Yang, Hang Zhao, Xiangdong Han, Min Zhu, Shaojun Wei, Leibo Liu","doi":"10.46586/tches.v2024.i2.630-653","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.630-653","url":null,"abstract":"Masking, an effective countermeasure against side-channel attacks, is commonly applied in modern cryptographic implementations. Considering cryptographic algorithms that utilize both Boolean and arithmetic masking, the conversion algorithm between arithmetic masking and Boolean masking is required. Conventional high-order arithmetic masking to Boolean masking conversion algorithms based on Boolean circuits suffer from performance overhead, especially in terms of hardware implementation. In this work, we analyze high latency for the conversion and propose an improved high-order A2B conversion algorithm. For the conversion of 16-bit variables, the hardware latency can be reduced by 47% in the best scenario. For the case study of second-order 32-bit conversion, the implementation results show that the improved scheme reduces the clock cycle latency by 42% in hardware and achieves a 30% speed performance improvement in software. Theoretically, a security proof of arbitrary order is provided for the proposed high-order A2B conversion. Experimental validations are performed to verify the second-order DPA resistance of second-order implementation. The Test Vector Leakage Assessment does not observe side-channel leakage for hardware and software implementations.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"51 4","pages":"45"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140249878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.25-63
Tian Zhou, Fangyu Zheng, Guang Fan, Lipeng Wan, Wenxu Tang, Yixuan Song, Yi Bian, Jingqiang Lin
The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized.In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.
{"title":"ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches","authors":"Tian Zhou, Fangyu Zheng, Guang Fan, Lipeng Wan, Wenxu Tang, Yixuan Song, Yi Bian, Jingqiang Lin","doi":"10.46586/tches.v2024.i2.25-63","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.25-63","url":null,"abstract":"The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized.In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"32 6","pages":"95"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140250342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.304-328
Gora Adj, Stefano Barbero, Emanuele Bellini, Andre Esser, Luis Rivera-Zamarripa, Carlo Sanna, Javier A. Verbel, Floyd Zweydinger
Since 2016’s NIST call for standardization of post-quantum cryptographic primitives, developing efficient post-quantum secure digital signature schemes has become a highly active area of research. The difficulty in constructing such schemes is evidenced by NIST reopening the call in 2022 for digital signature schemes, because of missing diversity in existing proposals. In this work, we introduce the new postquantum digital signature scheme MiRitH. As direct successor of a scheme recently developed by Adj, Rivera-Zamarripa and Verbel (Africacrypt ’23), it is based on the hardness of the MinRank problem and follows the MPC-in-the-Head paradigm. We revisit the initial proposal, incorporate design-level improvements and provide more efficient parameter sets. We also provide the missing justification for the quantum security of all parameter sets following NIST metrics. In this context we design a novel Grover-amplified quantum search algorithm for solving the MinRank problem that outperforms a naive quantum brute-force search for the solution.MiRitH obtains signatures of size 5.7 kB for NIST category I security and therefore competes for the smallest signatures among any post-quantum signature following the MPCitH paradigm.At the same time MiRitH offers competitive signing and verification timings compared to the state of the art. To substantiate those claims we provide extensive implementations. This includes a reference implementation as well as optimized constant-time implementations for Intel processors (AVX2), and for the ARM (NEON) architecture. The speedup of our optimized AVX2 implementation relies mostly on a redesign of the finite field arithmetic, improving over existing implementations as well as an improved memory management.
{"title":"MiRitH: Efficient Post-Quantum Signatures from MinRank in the Head","authors":"Gora Adj, Stefano Barbero, Emanuele Bellini, Andre Esser, Luis Rivera-Zamarripa, Carlo Sanna, Javier A. Verbel, Floyd Zweydinger","doi":"10.46586/tches.v2024.i2.304-328","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.304-328","url":null,"abstract":"Since 2016’s NIST call for standardization of post-quantum cryptographic primitives, developing efficient post-quantum secure digital signature schemes has become a highly active area of research. The difficulty in constructing such schemes is evidenced by NIST reopening the call in 2022 for digital signature schemes, because of missing diversity in existing proposals. In this work, we introduce the new postquantum digital signature scheme MiRitH. As direct successor of a scheme recently developed by Adj, Rivera-Zamarripa and Verbel (Africacrypt ’23), it is based on the hardness of the MinRank problem and follows the MPC-in-the-Head paradigm. We revisit the initial proposal, incorporate design-level improvements and provide more efficient parameter sets. We also provide the missing justification for the quantum security of all parameter sets following NIST metrics. In this context we design a novel Grover-amplified quantum search algorithm for solving the MinRank problem that outperforms a naive quantum brute-force search for the solution.MiRitH obtains signatures of size 5.7 kB for NIST category I security and therefore competes for the smallest signatures among any post-quantum signature following the MPCitH paradigm.At the same time MiRitH offers competitive signing and verification timings compared to the state of the art. To substantiate those claims we provide extensive implementations. This includes a reference implementation as well as optimized constant-time implementations for Intel processors (AVX2), and for the ARM (NEON) architecture. The speedup of our optimized AVX2 implementation relies mostly on a redesign of the finite field arithmetic, improving over existing implementations as well as an improved memory management.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"5 3","pages":"1666"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140251025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.481-521
S. Banik, F. Regazzoni
The Möbius transform is a linear circuit used to compute the evaluations of a Boolean function over all points on its input domain. The operation is very useful in finding the solution of a system of polynomial equations over GF(2) for obvious reasons. However the operation, although linear, needs exponential number of logic operations (around n · 2n−1 bit xors) for an n-variable Boolean function. As such, the only known hardware circuit to efficiently compute the Möbius Transform requires silicon area that is exponential in n. For Boolean functions whose algebraic degree is bound by some parameter d, recursive definitions of the Möbius Transform exist that requires only O(nd+1) space in software. However converting the mathematical definition of this space-efficient algorithm into a hardware architecture is a non-trivial task, primarily because the recursion calls notionally lead to a depth-first search in a transition graph that requires context switches at each recursion call for which straightforward mapping to hardware is difficult. In this paper we look to overcome these very challenges in an engineering sense. We propose a space efficient sequential hardware circuit for the Möbius Transform that requires only polynomial circuit area (i.e. O(nd+1)) provided the algebraic degree of the Boolean function is limited to d. We show how this circuit can be used as a component to efficiently solve polynomial equations of degree at most d by using fast exhaustive search. We propose three different circuit architectures for this, each of which uses the Möbius Transform circuit as a core component. We show that asymptotically, all the solutions of a system of m polynomials in n unknowns and algebraic degree d over GF(2) can be found using a circuit of silicon area proportional to m · nd+1 and circuit depth proportional to 2 · log2(n − d).In the second part of the paper we introduce a fourth hardware solver that additionally aims to achieve energy efficiency. The main idea is to reduce the solution space to a small enough value by parallel application of Möbius Transform circuits over the first few equations of the system. This is done so that one can check individually whether the vectors of this reduced solution space satisfy each of the remaining equations of the system using lower power consumption. The new circuit has area also bound by m · nd+1 and has circuit depth proportional to d · log2 n. We also show that further optimizations with respect to energy consumption may be obtained by using depth-bound Möbius circuits that exponentially decrease run time at the cost of additional logic area and depth.
莫比乌斯变换是一种线性电路,用于计算布尔函数对其输入域上所有点的求值。由于显而易见的原因,该运算在寻找 GF(2) 上多项式方程组的解时非常有用。然而,尽管该运算是线性的,但对于一个 n 变量布尔函数来说,却需要指数数量的逻辑运算(约 n - 2n-1 位 xors)。因此,已知唯一能有效计算莫比乌斯变换的硬件电路所需的硅片面积是 n 的指数倍。对于代数阶数受某个参数 d 约束的布尔函数,莫比乌斯变换的递归定义只需要 O(nd+1) 的软件空间。然而,将这种空间高效算法的数学定义转换为硬件架构并非易事,这主要是因为递归调用在概念上导致在过渡图中进行深度优先搜索,而每次递归调用都需要进行上下文切换,这就很难直接映射到硬件上。在本文中,我们希望在工程学意义上克服这些挑战。我们为莫比乌斯变换提出了一种空间高效的顺序硬件电路,只要布尔函数的代数阶数不超过 d,它就只需要多项式电路面积(即 O(nd+1))。为此,我们提出了三种不同的电路架构,每种架构都使用莫比乌斯变换电路作为核心组件。我们的研究表明,从渐近的角度看,使用硅面积与 m - nd+1 成比例、电路深度与 2 - log2(n - d) 成比例的电路,可以求得 GF(2) 上 n 个未知数中 m 个多项式、代数阶数为 d 的系统的所有解。其主要思路是通过并行应用莫比乌斯变换电路对系统的前几个方程进行求解,从而将求解空间缩小到足够小的数值。这样,我们就能以更低的功耗逐个检查缩小后的解空间向量是否满足系统的每个剩余方程。新电路的面积也受 m - nd+1 约束,电路深度与 d - log2 n 成正比。我们还表明,通过使用深度约束莫比乌斯电路,可以以增加逻辑面积和深度为代价,指数级缩短运行时间,从而进一步优化能耗。
{"title":"Compact Circuits for Efficient Mobius Transform","authors":"S. Banik, F. Regazzoni","doi":"10.46586/tches.v2024.i2.481-521","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.481-521","url":null,"abstract":"The Möbius transform is a linear circuit used to compute the evaluations of a Boolean function over all points on its input domain. The operation is very useful in finding the solution of a system of polynomial equations over GF(2) for obvious reasons. However the operation, although linear, needs exponential number of logic operations (around n · 2n−1 bit xors) for an n-variable Boolean function. As such, the only known hardware circuit to efficiently compute the Möbius Transform requires silicon area that is exponential in n. For Boolean functions whose algebraic degree is bound by some parameter d, recursive definitions of the Möbius Transform exist that requires only O(nd+1) space in software. However converting the mathematical definition of this space-efficient algorithm into a hardware architecture is a non-trivial task, primarily because the recursion calls notionally lead to a depth-first search in a transition graph that requires context switches at each recursion call for which straightforward mapping to hardware is difficult. In this paper we look to overcome these very challenges in an engineering sense. We propose a space efficient sequential hardware circuit for the Möbius Transform that requires only polynomial circuit area (i.e. O(nd+1)) provided the algebraic degree of the Boolean function is limited to d. We show how this circuit can be used as a component to efficiently solve polynomial equations of degree at most d by using fast exhaustive search. We propose three different circuit architectures for this, each of which uses the Möbius Transform circuit as a core component. We show that asymptotically, all the solutions of a system of m polynomials in n unknowns and algebraic degree d over GF(2) can be found using a circuit of silicon area proportional to m · nd+1 and circuit depth proportional to 2 · log2(n − d).In the second part of the paper we introduce a fourth hardware solver that additionally aims to achieve energy efficiency. The main idea is to reduce the solution space to a small enough value by parallel application of Möbius Transform circuits over the first few equations of the system. This is done so that one can check individually whether the vectors of this reduced solution space satisfy each of the remaining equations of the system using lower power consumption. The new circuit has area also bound by m · nd+1 and has circuit depth proportional to d · log2 n. We also show that further optimizations with respect to energy consumption may be obtained by using depth-bound Möbius circuits that exponentially decrease run time at the cost of additional logic area and depth.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"20 4","pages":"948"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140248326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.46586/tches.v2024.i2.795-818
P. Ravi, Thales Paiva, Dirmanto Jap, Jan-Pieter D'Anvers, S. Bhasin
In an effort to circumvent the high cost of standard countermeasures against side-channel attacks in post-quantum cryptography, some works have developed low-cost detection-based countermeasures. These countermeasures try to detect maliciously generated input ciphertexts and react to them by discarding the ciphertext or secret key. In this work, we take a look at two previously proposed low-cost countermeasures: the ciphertext sanity check and the decapsulation failure check, and demonstrate successful attacks on these schemes. We show that the first countermeasure can be broken with little to no overhead, while the second countermeasure requires a more elaborate attack strategy that relies on valid chosen ciphertexts. Thus, in this work, we propose the first chosen-ciphertext based side-channel attack that only relies on valid ciphertexts for key recovery. As part of this attack, a third contribution of our paper is an improved solver that retrieves the secret key from linear inequalities constructed using side-channel leakage from the decryption procedure. Our solver is an improvement over the state-of-the-art Belief Propagation solvers by Pessl and Prokop, and later Delvaux. Our method is simpler, easier to understand and has lower computational complexity, while needing less than half the inequalities compared to previous methods.
{"title":"Defeating Low-Cost Countermeasures against Side-Channel Attacks in Lattice-based Encryption - A Case Study on Crystals-Kyber","authors":"P. Ravi, Thales Paiva, Dirmanto Jap, Jan-Pieter D'Anvers, S. Bhasin","doi":"10.46586/tches.v2024.i2.795-818","DOIUrl":"https://doi.org/10.46586/tches.v2024.i2.795-818","url":null,"abstract":"In an effort to circumvent the high cost of standard countermeasures against side-channel attacks in post-quantum cryptography, some works have developed low-cost detection-based countermeasures. These countermeasures try to detect maliciously generated input ciphertexts and react to them by discarding the ciphertext or secret key. In this work, we take a look at two previously proposed low-cost countermeasures: the ciphertext sanity check and the decapsulation failure check, and demonstrate successful attacks on these schemes. We show that the first countermeasure can be broken with little to no overhead, while the second countermeasure requires a more elaborate attack strategy that relies on valid chosen ciphertexts. Thus, in this work, we propose the first chosen-ciphertext based side-channel attack that only relies on valid ciphertexts for key recovery. As part of this attack, a third contribution of our paper is an improved solver that retrieves the secret key from linear inequalities constructed using side-channel leakage from the decryption procedure. Our solver is an improvement over the state-of-the-art Belief Propagation solvers by Pessl and Prokop, and later Delvaux. Our method is simpler, easier to understand and has lower computational complexity, while needing less than half the inequalities compared to previous methods.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"99 2","pages":"1627"},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140250555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}