Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129383
Hyunwoo Kim, Hyundong Lee, Jongbeom Kim, Yunjeong Go, Seungwon Baek, Jae-Seok Song, Junhyeon Kim, Minyoung Jung, Hyodong Kim, Seong-Jae Kim, Taigon Song
A vast number of data used for Artificial intelligence causes bottleneck between the processor and memory. To tackle this issue, a technology that embeds a processing unit in the memory (PIM: Processing-in Memory) has been proposed. However, SRAM/DRAM based PIM have a issue for lack of capacity. Thus, we propose a NAND flash PIM scheme that shares the cache register. Our scheme significantly reduces the read latency and operation time by -22.8% and -43.7%, compared to the conventional memory system. The power-performance-area (PPA) was reduced by 17.2% by shortening the number of cycles. Our NAND PIM specializes in tasks requiring high-performance computing.
{"title":"Cache Register Sharing Structure for Channel-level Near-memory Processing in NAND Flash Memory","authors":"Hyunwoo Kim, Hyundong Lee, Jongbeom Kim, Yunjeong Go, Seungwon Baek, Jae-Seok Song, Junhyeon Kim, Minyoung Jung, Hyodong Kim, Seong-Jae Kim, Taigon Song","doi":"10.1109/ISQED57927.2023.10129383","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129383","url":null,"abstract":"A vast number of data used for Artificial intelligence causes bottleneck between the processor and memory. To tackle this issue, a technology that embeds a processing unit in the memory (PIM: Processing-in Memory) has been proposed. However, SRAM/DRAM based PIM have a issue for lack of capacity. Thus, we propose a NAND flash PIM scheme that shares the cache register. Our scheme significantly reduces the read latency and operation time by -22.8% and -43.7%, compared to the conventional memory system. The power-performance-area (PPA) was reduced by 17.2% by shortening the number of cycles. Our NAND PIM specializes in tasks requiring high-performance computing.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130973356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129319
Ankit Shukla, L. Heller, Md Golam Morshed, L. Rehm, Avik W. Ghosh, A. Kent, S. Rakheja
Magnetic tunnel junctions (MTJs), which are the fundamental building blocks of spintronic devices, have been used to build true random number generators (TRNGs) with different trade-offs between throughput, power, and area requirements. MTJs with high-barrier magnets (HBMs) have been used to generate random bitstreams with ≲ 200 Mb/s throughput and pJ/bit energy consumption. A high temperature sensitivity, however, adversely affects their performance as a TRNG. Superparamagnetic MTJs employing low-barrier magnets (LBMs) have also been used for TRNG operation. Although LBM-based MTJs can operate at low energy, they suffer from slow dynamics, sensitivity to process variations, and low fabrication yield. In this paper, we model a TRNG based on medium-barrier magnets (MBMs) with perpendicular magnetic anisotropy. The proposed MBM-based TRNG is driven with short voltage pulses to induce ballistic, yet stochastic, magnetization switching. We show that the proposed TRNG can operate at frequencies of about 500 MHz while consuming less than 100 fJ/bit of energy. In the short-pulse ballistic limit, the switching probability of our device shows robustness to variations in temperature and material parameters relative to LBMs and HBMs. Our results suggest that MBM-based MTJs are suitable candidates for building fast and energy-efficient TRNG hardware units for probabilistic computing.
{"title":"A True Random Number Generator for Probabilistic Computing using Stochastic Magnetic Actuated Random Transducer Devices","authors":"Ankit Shukla, L. Heller, Md Golam Morshed, L. Rehm, Avik W. Ghosh, A. Kent, S. Rakheja","doi":"10.1109/ISQED57927.2023.10129319","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129319","url":null,"abstract":"Magnetic tunnel junctions (MTJs), which are the fundamental building blocks of spintronic devices, have been used to build true random number generators (TRNGs) with different trade-offs between throughput, power, and area requirements. MTJs with high-barrier magnets (HBMs) have been used to generate random bitstreams with ≲ 200 Mb/s throughput and pJ/bit energy consumption. A high temperature sensitivity, however, adversely affects their performance as a TRNG. Superparamagnetic MTJs employing low-barrier magnets (LBMs) have also been used for TRNG operation. Although LBM-based MTJs can operate at low energy, they suffer from slow dynamics, sensitivity to process variations, and low fabrication yield. In this paper, we model a TRNG based on medium-barrier magnets (MBMs) with perpendicular magnetic anisotropy. The proposed MBM-based TRNG is driven with short voltage pulses to induce ballistic, yet stochastic, magnetization switching. We show that the proposed TRNG can operate at frequencies of about 500 MHz while consuming less than 100 fJ/bit of energy. In the short-pulse ballistic limit, the switching probability of our device shows robustness to variations in temperature and material parameters relative to LBMs and HBMs. Our results suggest that MBM-based MTJs are suitable candidates for building fast and energy-efficient TRNG hardware units for probabilistic computing.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129286
Qazi Arbab Ahmed, M. Awais, M. Platzner
Automated frameworks for approximate accelerator synthesis employ an iterative search-based approach to generate approximate instances of hardware. While offering distinct savings in terms of hardware area and power consumption, approximate circuits are potentially at risk of being infected with hardware Trojans mainly due to the fact that the approximation is typically provided by third-party approximate accelerator synthesis frameworks which utilize components libraries to perform substitutions during the design space exploration phase. In this paper, we propose a threat model that discusses the potential of hardware Trojans insertion during the approximate accelerator synthesis. Moreover, we present MAAS, a framework that exploits a search-based approximate accelerator synthesis technique to demonstrate the applicability of our threat model by hiding Trojans in approximate circuits. The experimental results show that the approximate circuits generated by MAAS containing infected hardware Trojans are slightly larger than the approximate designs and are hard to identify via conventional area and power measurement techniques. To the best of our knowledge, this is the first effort to demonstrate the hardware Trojan insertion in the third-party approximate accelerator synthesis flow via library component substitution.
{"title":"MAAS: Hiding Trojans in Approximate Circuits","authors":"Qazi Arbab Ahmed, M. Awais, M. Platzner","doi":"10.1109/ISQED57927.2023.10129286","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129286","url":null,"abstract":"Automated frameworks for approximate accelerator synthesis employ an iterative search-based approach to generate approximate instances of hardware. While offering distinct savings in terms of hardware area and power consumption, approximate circuits are potentially at risk of being infected with hardware Trojans mainly due to the fact that the approximation is typically provided by third-party approximate accelerator synthesis frameworks which utilize components libraries to perform substitutions during the design space exploration phase. In this paper, we propose a threat model that discusses the potential of hardware Trojans insertion during the approximate accelerator synthesis. Moreover, we present MAAS, a framework that exploits a search-based approximate accelerator synthesis technique to demonstrate the applicability of our threat model by hiding Trojans in approximate circuits. The experimental results show that the approximate circuits generated by MAAS containing infected hardware Trojans are slightly larger than the approximate designs and are hard to identify via conventional area and power measurement techniques. To the best of our knowledge, this is the first effort to demonstrate the hardware Trojan insertion in the third-party approximate accelerator synthesis flow via library component substitution.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"525 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123205614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129310
Yueqin Dai, Yifeng Song, Jing Tian, Zhongfeng Wang
SPHINCS+, a hash-based signature scheme, has stood out as one of the four winners in the post-quantum cryptography (PQC) competition hosted by the U.S. National Institute of Standards and Technology (NIST). However, the slow signing speed forms a bottleneck for applications. Therefore, a kind of short-input hash function named Haraka is recommended as the third instantiation in SPHINCS+ due to its advantage in processing speed. In this work, we propose four hardware architecture schemes for Haraka in SPHINCS+, denoted as Case I to Case IV. Several optimization methods are combined and applied in different cases to perform the trade-off between area and throughput for different application scenarios. We code our designs in System Verilog language and synthesize them under the TSMC 28-nm CMOS technology. The experiment results show that Case IV achieves the best throughput and the most efficient performance, about 81.92 Gbps and 1.26 Mbps/GE, respectively, which also significantly outperforms the state-of-the-art implementation of Haraka and the advanced hardware implementation of the SHA-3 hash function.
{"title":"High-Throughput Hardware Implementation for Haraka in SPHINCS+","authors":"Yueqin Dai, Yifeng Song, Jing Tian, Zhongfeng Wang","doi":"10.1109/ISQED57927.2023.10129310","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129310","url":null,"abstract":"SPHINCS+, a hash-based signature scheme, has stood out as one of the four winners in the post-quantum cryptography (PQC) competition hosted by the U.S. National Institute of Standards and Technology (NIST). However, the slow signing speed forms a bottleneck for applications. Therefore, a kind of short-input hash function named Haraka is recommended as the third instantiation in SPHINCS+ due to its advantage in processing speed. In this work, we propose four hardware architecture schemes for Haraka in SPHINCS+, denoted as Case I to Case IV. Several optimization methods are combined and applied in different cases to perform the trade-off between area and throughput for different application scenarios. We code our designs in System Verilog language and synthesize them under the TSMC 28-nm CMOS technology. The experiment results show that Case IV achieves the best throughput and the most efficient performance, about 81.92 Gbps and 1.26 Mbps/GE, respectively, which also significantly outperforms the state-of-the-art implementation of Haraka and the advanced hardware implementation of the SHA-3 hash function.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125544185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129295
P. Paul, Maisha Sadia, Anur Dhungel, Parker Hardy, Md. Sakib Hasan
This paper presents a novel one-dimensional discrete-time chaotic map. A significantly improved chaotic behavior, compared to already published one-dimensional maps, is achieved in the proposed design by virtue of this non-linear map’s stiffer transfer characteristics. The novelty of the work comes from the proposed methodology of splitting upward and downward slopping mechanisms to gain a stiffer slope in the uni-modal nonlinear circuit. The design methodology is presented with the help of the stability analysis of fixed points, which is generally applicable to a wide variety of nonlinear circuits. The chaotic complexity of the proposed circuit is analyzed with the bifurcation plot, correlation coefficient, and Lyapunov Exponent. The results are compared with reported works to demonstrate a significant improvement. Along with high chaotic complexity, this split-slope chaotic map provides a wide chaotic range covering 100% of the overall region of operation. The high chaotic complexity across a wide chaotic range is achieved with a remarkably low transistor-count circuit which is suitable in many hardware-security applications including, random number generation, chaotic logic circuits, and so on, for resource-constrained devices.
{"title":"Split-Slope Chaotic Map Providing High Entropy Across Wide Range","authors":"P. Paul, Maisha Sadia, Anur Dhungel, Parker Hardy, Md. Sakib Hasan","doi":"10.1109/ISQED57927.2023.10129295","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129295","url":null,"abstract":"This paper presents a novel one-dimensional discrete-time chaotic map. A significantly improved chaotic behavior, compared to already published one-dimensional maps, is achieved in the proposed design by virtue of this non-linear map’s stiffer transfer characteristics. The novelty of the work comes from the proposed methodology of splitting upward and downward slopping mechanisms to gain a stiffer slope in the uni-modal nonlinear circuit. The design methodology is presented with the help of the stability analysis of fixed points, which is generally applicable to a wide variety of nonlinear circuits. The chaotic complexity of the proposed circuit is analyzed with the bifurcation plot, correlation coefficient, and Lyapunov Exponent. The results are compared with reported works to demonstrate a significant improvement. Along with high chaotic complexity, this split-slope chaotic map provides a wide chaotic range covering 100% of the overall region of operation. The high chaotic complexity across a wide chaotic range is achieved with a remarkably low transistor-count circuit which is suitable in many hardware-security applications including, random number generation, chaotic logic circuits, and so on, for resource-constrained devices.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130686613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129342
Daniel Xing, Michael Zuzak, A. Srivastava
Logic locking techniques have been proposed to protect chip designs from malicious reverse engineering and overproduction. Stripped functionality logic locking (SFLL) has gained substantial traction as a current state of the art method, exhibiting strong resilience against a wide variety of attacks. However, secure instances of SFLL-based locking tend to have high power and area overheads, particularly in its restore units. This work presents a novel architectural approach to restore unit configuration for SFLL-like logic locking methods that treats restore units as an overhead-constrained shareable resource. We describe how resource contention caused by sharing of restore units imposes constraints on the underlying locking scheme from a graph theoretic perspective and propose both a 0-1 ILP and a heuristic clustering algorithm for finding resource-constrained shared locking configurations that satisfy these constraints. We evaluate our sharing method on SFLL-flex and find that our ILP and heuristic methods were each able to achieve a 55% and 31% reduction in power used by locked datapaths synthesized from MediaBench benchmarks while maintaining the same security and functionality compared to datapaths locked with conventional gate-level techniques.
{"title":"Low Overhead System-Level Obfuscation through Hardware Resource Sharing","authors":"Daniel Xing, Michael Zuzak, A. Srivastava","doi":"10.1109/ISQED57927.2023.10129342","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129342","url":null,"abstract":"Logic locking techniques have been proposed to protect chip designs from malicious reverse engineering and overproduction. Stripped functionality logic locking (SFLL) has gained substantial traction as a current state of the art method, exhibiting strong resilience against a wide variety of attacks. However, secure instances of SFLL-based locking tend to have high power and area overheads, particularly in its restore units. This work presents a novel architectural approach to restore unit configuration for SFLL-like logic locking methods that treats restore units as an overhead-constrained shareable resource. We describe how resource contention caused by sharing of restore units imposes constraints on the underlying locking scheme from a graph theoretic perspective and propose both a 0-1 ILP and a heuristic clustering algorithm for finding resource-constrained shared locking configurations that satisfy these constraints. We evaluate our sharing method on SFLL-flex and find that our ILP and heuristic methods were each able to achieve a 55% and 31% reduction in power used by locked datapaths synthesized from MediaBench benchmarks while maintaining the same security and functionality compared to datapaths locked with conventional gate-level techniques.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132128078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/isqed57927.2023.10129367
Yuqin Dou, Chongyan Gu, Chenghua Wang, Weiqiang Liu
Approximate computing is a promising computing paradigm that trades off power consumption and performance for error-tolerant applications. Approximate computing has been widely studied, such as for arithmetic circuits and accelerators. However, recent research has shown security vulnerabilities in approximate circuits. Hardware Trojans are one of the major threats to hardware circuits and have not been fully studied for approximate computing. Majority voting (MV) based on vendor diversities has been proposed as an effective technique to mask and/or detect hardware Trojans when assembling trusted chips with untrustworthy components. However, the randomness and diversity of inherent errors in approximate circuits can invalidate the MV technique. In this paper, for the first time, the authors present the challenges to approximate circuits when multiple vendors are considered to provide IPs for approximate circuits (IPac). Experiments demonstrate that the MV strategy is not applicable when trusted chips are assembled with IPacs. A comparison-based technique is proposed to thwart hardware Trojan attacks on approximate circuits. The experimental results show a high effectiveness of the proposed method to detect hardware Trojans in approximate circuits.
{"title":"A Novel Method Against Hardware Trojans in Approximate Circuits","authors":"Yuqin Dou, Chongyan Gu, Chenghua Wang, Weiqiang Liu","doi":"10.1109/isqed57927.2023.10129367","DOIUrl":"https://doi.org/10.1109/isqed57927.2023.10129367","url":null,"abstract":"Approximate computing is a promising computing paradigm that trades off power consumption and performance for error-tolerant applications. Approximate computing has been widely studied, such as for arithmetic circuits and accelerators. However, recent research has shown security vulnerabilities in approximate circuits. Hardware Trojans are one of the major threats to hardware circuits and have not been fully studied for approximate computing. Majority voting (MV) based on vendor diversities has been proposed as an effective technique to mask and/or detect hardware Trojans when assembling trusted chips with untrustworthy components. However, the randomness and diversity of inherent errors in approximate circuits can invalidate the MV technique. In this paper, for the first time, the authors present the challenges to approximate circuits when multiple vendors are considered to provide IPs for approximate circuits (IPac). Experiments demonstrate that the MV strategy is not applicable when trusted chips are assembled with IPacs. A comparison-based technique is proposed to thwart hardware Trojan attacks on approximate circuits. The experimental results show a high effectiveness of the proposed method to detect hardware Trojans in approximate circuits.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134457593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129324
Richard Yarnell, M. Hossain, R. Demara
Until recently, FPGA-based acceleration of convolutional neural networks (CNNs) has remained an open-ended research problem. Herein, we evaluate one new method for rapidly implementing CNNs using industry-standard frameworks within Xilinx UltraScale+ FPGA devices. Within this workflow, referred to as Framework for Accelerating YOLO-Based ML on Edge-devices (FAYME), a TensorFlow model of the You Only Look Once version 4 (YOLOv4) object detection algorithm is realized using Xilinx’s Vitis AI toolchain. We test various levels of model bit-quantization and evaluate performance while simultaneously analyzing the utilization of available memory and processing elements. We also implement a ResNet-50 model to provide additional comparisons. In this paper, we present our YOLO model, which achieves a mAP of 0.581, and our ResNet model, which achieves a Top-5 accuracy of 0.950. Furthermore, we demonstrate that these results are possible while utilizing less than 25% of the throughput offered by a single hardware accelerator in an UltraScale+ FPGA.
直到最近,基于fpga的卷积神经网络(cnn)加速仍然是一个开放式的研究问题。在此,我们评估了一种在Xilinx UltraScale+ FPGA器件中使用行业标准框架快速实现cnn的新方法。在这个被称为加速边缘设备上基于yolo4的机器学习框架(FAYME)的工作流程中,使用赛灵思的Vitis AI工具链实现了You Only Look Once version 4 (YOLOv4)对象检测算法的TensorFlow模型。我们测试了各种级别的模型位量化和评估性能,同时分析了可用内存和处理元素的利用率。我们还实现了一个ResNet-50模型来提供额外的比较。在本文中,我们提出了我们的YOLO模型,它实现了0.581的mAP,我们的ResNet模型,它实现了0.950的Top-5精度。此外,我们证明了这些结果是可能的,而在UltraScale+ FPGA中使用单个硬件加速器提供的吞吐量不到25%。
{"title":"Image Quantization Tradeoffs in a YOLO-based FPGA Accelerator Framework","authors":"Richard Yarnell, M. Hossain, R. Demara","doi":"10.1109/ISQED57927.2023.10129324","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129324","url":null,"abstract":"Until recently, FPGA-based acceleration of convolutional neural networks (CNNs) has remained an open-ended research problem. Herein, we evaluate one new method for rapidly implementing CNNs using industry-standard frameworks within Xilinx UltraScale+ FPGA devices. Within this workflow, referred to as Framework for Accelerating YOLO-Based ML on Edge-devices (FAYME), a TensorFlow model of the You Only Look Once version 4 (YOLOv4) object detection algorithm is realized using Xilinx’s Vitis AI toolchain. We test various levels of model bit-quantization and evaluate performance while simultaneously analyzing the utilization of available memory and processing elements. We also implement a ResNet-50 model to provide additional comparisons. In this paper, we present our YOLO model, which achieves a mAP of 0.581, and our ResNet model, which achieves a Top-5 accuracy of 0.950. Furthermore, we demonstrate that these results are possible while utilizing less than 25% of the throughput offered by a single hardware accelerator in an UltraScale+ FPGA.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134297972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129372
O. Paul, Sakib Abrar, Richard Mu, Riadul Islam, Manar D. Samad
Surface acoustic wave (SAW) sensors with increasingly unique and refined designed patterns are often developed using the lithographic fabrication processes. Emerging applications of SAW sensors often require novel materials, which may present uncharted fabrication outcomes. The fidelity of the SAW sensor performance is often correlated with the ability to restrict the presence of defects in post-fabrication. Therefore, it is critical to have effective means to detect the presence of defects within the SAW sensor. However, labor-intensive manual labeling is often required due to the need for precision identification and classification of surface features for increased confidence in model accuracy. One approach to automating defect detection is to leverage effective machine learning techniques to analyze and quantify defects within the SAW sensor. In this paper, we propose a machine learning approach using a deep convolutional autoencoder to segment surface features semantically. The proposed deep image autoencoder takes a grayscale input image and generates a color image segmenting the defect region in red, metallic interdigital transducing (IDT) fingers in green, and the substrate region in blue. Experimental results demonstrate promising segmentation scores in locating the defects and regions of interest for a novel SAW sensor variant. The proposed method can automate the process of localizing and measuring post-fabrication defects at the pixel level that may be missed by error-prone visual inspection.
{"title":"Deep Image Segmentation for Defect Detection in Photo-lithography Fabrication","authors":"O. Paul, Sakib Abrar, Richard Mu, Riadul Islam, Manar D. Samad","doi":"10.1109/ISQED57927.2023.10129372","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129372","url":null,"abstract":"Surface acoustic wave (SAW) sensors with increasingly unique and refined designed patterns are often developed using the lithographic fabrication processes. Emerging applications of SAW sensors often require novel materials, which may present uncharted fabrication outcomes. The fidelity of the SAW sensor performance is often correlated with the ability to restrict the presence of defects in post-fabrication. Therefore, it is critical to have effective means to detect the presence of defects within the SAW sensor. However, labor-intensive manual labeling is often required due to the need for precision identification and classification of surface features for increased confidence in model accuracy. One approach to automating defect detection is to leverage effective machine learning techniques to analyze and quantify defects within the SAW sensor. In this paper, we propose a machine learning approach using a deep convolutional autoencoder to segment surface features semantically. The proposed deep image autoencoder takes a grayscale input image and generates a color image segmenting the defect region in red, metallic interdigital transducing (IDT) fingers in green, and the substrate region in blue. Experimental results demonstrate promising segmentation scores in locating the defects and regions of interest for a novel SAW sensor variant. The proposed method can automate the process of localizing and measuring post-fabrication defects at the pixel level that may be missed by error-prone visual inspection.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-05DOI: 10.1109/ISQED57927.2023.10129332
Tinaqi Zhang, Sahand Salamat, Behnam Khaleghi, Justin Morris, Baris Aksanli, T. Simunic
Building a highly-efficient FPGA accelerator for Hyperdimensional (HD) computing is tedious work that requires Register Transfer Level (RTL) programming and verification. An inexperienced designer might waste significant time finding the best resource allocation scheme to achieve the target performance under resource constraints, especially for edge applications. HD computing is a novel computational paradigm that emulates brain functionality in performing cognitive tasks. The underlying computations of HD involve a substantial number of element-wise operations (e.g., additions and multiplications) on ultra-wide hypervectors (HVs), which can be effectively parallelized and pipelined. Although different HD applications might vary in terms of the number of input features and output classes (labels), they generally follow the same computation flow. In this paper, we propose HD2FPGA, an automated tool that generates fast and highly efficient FPGA-based accelerators for HD classification and clustering. HD2FPGA eliminates the arduous task of hand-crafted design of hardware accelerators by leveraging a template of optimized processing elements to automatically generate an FPGA implementation as a function of application specifications and user constraints. For HD classification HD2FPGA, on average, provides 1.5× (up to 2.5×) speedup compared to the state-of-the-art FPGA-based accelerator and 36.6× speedup with 5.4× higher energy efficiency compared to the GPU-based one. For HD clustering, HD2FPGA is 2.2× faster than the GPU framework.
{"title":"HD2FPGA: Automated Framework for Accelerating Hyperdimensional Computing on FPGAs","authors":"Tinaqi Zhang, Sahand Salamat, Behnam Khaleghi, Justin Morris, Baris Aksanli, T. Simunic","doi":"10.1109/ISQED57927.2023.10129332","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129332","url":null,"abstract":"Building a highly-efficient FPGA accelerator for Hyperdimensional (HD) computing is tedious work that requires Register Transfer Level (RTL) programming and verification. An inexperienced designer might waste significant time finding the best resource allocation scheme to achieve the target performance under resource constraints, especially for edge applications. HD computing is a novel computational paradigm that emulates brain functionality in performing cognitive tasks. The underlying computations of HD involve a substantial number of element-wise operations (e.g., additions and multiplications) on ultra-wide hypervectors (HVs), which can be effectively parallelized and pipelined. Although different HD applications might vary in terms of the number of input features and output classes (labels), they generally follow the same computation flow. In this paper, we propose HD2FPGA, an automated tool that generates fast and highly efficient FPGA-based accelerators for HD classification and clustering. HD2FPGA eliminates the arduous task of hand-crafted design of hardware accelerators by leveraging a template of optimized processing elements to automatically generate an FPGA implementation as a function of application specifications and user constraints. For HD classification HD2FPGA, on average, provides 1.5× (up to 2.5×) speedup compared to the state-of-the-art FPGA-based accelerator and 36.6× speedup with 5.4× higher energy efficiency compared to the GPU-based one. For HD clustering, HD2FPGA is 2.2× faster than the GPU framework.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"44 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114121657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}