Pub Date : 2024-12-04DOI: 10.1109/TVLSI.2024.3505920
Anawin Opasatian;Makoto Ikeda
Modular multiplication is a fundamental operation in many cryptographic systems, with its efficiency playing a crucial role in the overall performance of these systems. Since many cryptographic systems operate with a fixed modulus, we propose an enhancement to the fixed modulus lookup table (LuT) method used for modular reduction, which we refer to as the manipulated LuT (MLuT) method. Our approach applies to any modulus and has demonstrated comparable performance compared with some specialized reduction algorithms designed for specific moduli. The strength of our proposed method in terms of circuit performance is shown by implementing it on Virtex7 and Virtex Ultrascale+ FPGA as the LUT-based MLuT modular multiplier (LUT-MLuTMM) with generalized parallel counters (GPCs) used in the summation step. In one-stage implementations, our proposed method achieves up to a 90% reduction in area and a 50% reduction in latency compared with the generic LuT method. In multistage implementations, our approach offers the best area-interleaved time product, with improvements of 39%, 13%, and 29% over the current state-of-the-art for ~256-bit, SIKE434, and BLS12-381 modular multipliers, respectively. These results demonstrate the potential of our method for high-performance cryptographic accelerators employing a fixed modulus.
{"title":"Manipulated Lookup Table Method for Efficient High-Performance Modular Multiplier","authors":"Anawin Opasatian;Makoto Ikeda","doi":"10.1109/TVLSI.2024.3505920","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3505920","url":null,"abstract":"Modular multiplication is a fundamental operation in many cryptographic systems, with its efficiency playing a crucial role in the overall performance of these systems. Since many cryptographic systems operate with a fixed modulus, we propose an enhancement to the fixed modulus lookup table (LuT) method used for modular reduction, which we refer to as the manipulated LuT (MLuT) method. Our approach applies to any modulus and has demonstrated comparable performance compared with some specialized reduction algorithms designed for specific moduli. The strength of our proposed method in terms of circuit performance is shown by implementing it on Virtex7 and Virtex Ultrascale+ FPGA as the LUT-based MLuT modular multiplier (LUT-MLuTMM) with generalized parallel counters (GPCs) used in the summation step. In one-stage implementations, our proposed method achieves up to a 90% reduction in area and a 50% reduction in latency compared with the generic LuT method. In multistage implementations, our approach offers the best area-interleaved time product, with improvements of 39%, 13%, and 29% over the current state-of-the-art for ~256-bit, SIKE434, and BLS12-381 modular multipliers, respectively. These results demonstrate the potential of our method for high-performance cryptographic accelerators employing a fixed modulus.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"114-127"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10777922","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1109/TVLSI.2024.3507714
Jhe-En Lin;Shen-Iuan Liu
This article presents a 40-Gb/s (25.6-GBaud) three-level pulse amplitude modulation (PAM-3) baud-rate receiver with one-tap decision-feedback equalize (DFE). A baud-rate phase detector (BRPD) that locks at the point with zero first postcursor is proposed. In addition, by reusing the BRPD’s error samplers, a weighting coefficient calibration is presented to select the DFE weighting coefficient that maximizes the top level of the eye diagram, thereby improving eye height across different channel losses. An inductorless continuous-time linear equalizer (CTLE) and a variable gain amplifier (VGA) are also included. The VGA adjusts the output common-mode resistance to control data swing, reducing power consumption when the required swing is small. Furthermore, by using the modified summer-merged slicers, the capacitance from the slicers to the VGA is reduced. Finally, a digital clock/data recovery (CDR) circuit is presented, which includes a demultiplexer (DeMUX) with a short delay time to reduce the loop latency. The 40-Gb/s PAM-3 receiver is fabricated in 28-nm CMOS technology. For a 25.6-Gbaud pseudorandom ternary sequence of $3^{7}$