Secure and Efficient Masking of Lightweight Ciphers in Software and Hardware

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Computer Journal Pub Date : 2023-03-01 DOI:10.1093/comjnl/bxad002

Xuefeng Zhao

{"title":"Secure and Efficient Masking of Lightweight Ciphers in Software and Hardware","authors":"Xuefeng Zhao","doi":"10.1093/comjnl/bxad002","DOIUrl":null,"url":null,"abstract":"Abstract Masking is a well used and widely deployed countermeasure against side channel attacks, both in software and hardware. With masking comes at a great cost, search has focused on how to lower a performance penalty or find efficient masking implementation. In particular, our contribution is 2-fold: for software masking, we first find bitsliced implementations of Sbox with Multiplicative Complexity 4 and Multiplicative Depth 2, then adapt the common shares approach introduced by Coron et al. at CHES 2016 to make many cross-products $a_{i}\\cdot b_{j}$ can be reuse for parallel ISW-based 32-bit nonlinear operations. Therefore, we improve the efficiency of 2$\\times b/4/32$ parallel high-order masking of ISW scheme for RECTANGLE, TANGRAM and KNOT on 32-bit ARM embedded microprocessor, with roughly a 13%-34% speed-up, at cost of $(1+d) \\times 32$-bit randomness. For hardware masking, 4 bit cubic Sboxes with quadratic decomposition length 2, including RECTANGLE, TANGRAM, KNOT and LWC third-round candidates, can be implemented with a 3-share and 4-share threshold implementation (TI) by decomposing cubic permutations $S$ as a composition of sub-permutations having lower algebraic degrees. We use two decomposition form: one composition of two quadratic permutations $G$ and $F$, $S = F\\circ G$, is for efficiency; the other composition of some linear permutations $A_i$ and one quadratic permutation $G$, $S=A_3 \\circ G \\circ A_2 \\circ G \\circ A_1 $, is for reducing the area requirements. For $S = F\\circ G$, we introduce a new approach of searching through all possible quadratic permutations $G$ with 2$^{25.71}$, which is effcient than 2$^{26.23}$ in Poschmann et al. at J. Cryptol 2011. For $S=A_3 \\circ G \\circ A_2 \\circ G \\circ A_1 $, our approach of finding $A_i$ with complexity 2$^{27.71} $, which is effcient than the method introduced by Moradi et al. at ASIACRYPT 2016. In addition, we proposes a new decomposition that $S=G \\circ A_2 \\circ G \\circ A_1 $. We can find the fastest and the smallest hard-ware decomposition implementation of 4-bit permutations for TI with 3 and 4 shares.","PeriodicalId":50641,"journal":{"name":"Computer Journal","volume":"72 1","pages":"0"},"PeriodicalIF":1.5000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/comjnl/bxad002","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Masking is a well used and widely deployed countermeasure against side channel attacks, both in software and hardware. With masking comes at a great cost, search has focused on how to lower a performance penalty or find efficient masking implementation. In particular, our contribution is 2-fold: for software masking, we first find bitsliced implementations of Sbox with Multiplicative Complexity 4 and Multiplicative Depth 2, then adapt the common shares approach introduced by Coron et al. at CHES 2016 to make many cross-products $a_{i}\cdot b_{j}$ can be reuse for parallel ISW-based 32-bit nonlinear operations. Therefore, we improve the efficiency of 2$\times b/4/32$ parallel high-order masking of ISW scheme for RECTANGLE, TANGRAM and KNOT on 32-bit ARM embedded microprocessor, with roughly a 13%-34% speed-up, at cost of $(1+d) \times 32$-bit randomness. For hardware masking, 4 bit cubic Sboxes with quadratic decomposition length 2, including RECTANGLE, TANGRAM, KNOT and LWC third-round candidates, can be implemented with a 3-share and 4-share threshold implementation (TI) by decomposing cubic permutations $S$ as a composition of sub-permutations having lower algebraic degrees. We use two decomposition form: one composition of two quadratic permutations $G$ and $F$, $S = F\circ G$, is for efficiency; the other composition of some linear permutations $A_i$ and one quadratic permutation $G$, $S=A_3 \circ G \circ A_2 \circ G \circ A_1 $, is for reducing the area requirements. For $S = F\circ G$, we introduce a new approach of searching through all possible quadratic permutations $G$ with 2$^{25.71}$, which is effcient than 2$^{26.23}$ in Poschmann et al. at J. Cryptol 2011. For $S=A_3 \circ G \circ A_2 \circ G \circ A_1 $, our approach of finding $A_i$ with complexity 2$^{27.71} $, which is effcient than the method introduced by Moradi et al. at ASIACRYPT 2016. In addition, we proposes a new decomposition that $S=G \circ A_2 \circ G \circ A_1 $. We can find the fastest and the smallest hard-ware decomposition implementation of 4-bit permutations for TI with 3 and 4 shares.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

软件和硬件中轻量级密码的安全有效掩蔽

掩蔽是一种应用广泛的对抗侧信道攻击的方法，无论是在软件还是硬件上都是如此。由于屏蔽的代价很大，搜索的重点是如何降低性能损失或找到有效的屏蔽实现。特别是，我们的贡献是双重的:对于软件屏蔽，我们首先找到了具有乘法复杂度4和乘法深度2的Sbox的位切片实现，然后采用Coron等人在CHES 2016上引入的公共共享方法，使许多交叉乘积$a_{i}\cdot b_{j}$可以被重用用于并行的基于isw的32位非线性操作。因此，我们在32位ARM嵌入式微处理器上，以$(1+d) $ × 32位随机性为代价，提高了2$\times b/4/32$并行ISW方案的高阶掩码效率，大约提高了13%-34%的速度。对于硬件掩蔽，通过将立方排列$S$分解为具有较低代数度的子排列的组合，可以用3共享和4共享阈值实现(TI)实现二次分解长度为2的4位立方Sboxes(包括RECTANGLE、TANGRAM、KNOT和LWC第三轮候选)。我们采用了两种分解形式:一种是由两个二次置换$G$和$F$组成，$S = F\circ G$，是为了效率;另一个线性排列$A_i$和一个二次排列$G$的组合$S=A_3 \circ G \circ A_2 \circ G \circ A_1 $是为了减少面积要求。对于$S = F\circ G$，我们引入了一种用2$^{25.71}$搜索所有可能的二次置换$G$的新方法，该方法比Poschmann et al. J. Cryptol 2011中的2$^{26.23}$高效。对于$S=A_3 \circ G \circ A_2 \circ G \circ A_1 $，我们找到复杂度为2$^{27.71}$的$A_i$的方法比Moradi等人在ASIACRYPT 2016上介绍的方法更有效。此外，我们提出了一个新的分解$S=G \circ A_2 \circ G \circ A_1 $。我们可以找到最快和最小的硬件分解实现的4位排列的TI与3和4份额。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Journal 工程技术-计算机：软件工程

CiteScore

3.60

自引率

7.10%

发文量

164

审稿时长

4.8 months

期刊介绍： The Computer Journal is one of the longest-established journals serving all branches of the academic computer science community. It is currently published in four sections.