Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3298741
Bo Zhang;Caixu Zhao;Xi Li
In real-time systems, it is essential to verify the end-to-end constraints that regulate the external input/output (I/O) semantics of the head and tail tasks in each effect chain during the design phase and preserve them during implementation. The logical execution time (LET) model has been adopted by the industry due to the predictability and composability of its timed behavior. However, during the execution of LET-based effect chains, there are ineffective jobs whose outputs are redundant or unused and do not contribute to the external I/O behavior. This letter proposes an offline optimization method for deriving multiframe tasks that achieve the external timed I/O semantics of the LET-based effect chains with reduced utilization. The method first removes ineffective jobs from each effect chain and further explores the benefits of removing jobs for single and crossing effect chains by loosening the LET interval. The method is evaluated using synthetic benchmarks that mimic real-world automotive applications.
{"title":"External Timed I/O Semantics Preserving Utilization Optimization for LET-Based Effect Chain","authors":"Bo Zhang;Caixu Zhao;Xi Li","doi":"10.1109/LES.2023.3298741","DOIUrl":"10.1109/LES.2023.3298741","url":null,"abstract":"In real-time systems, it is essential to verify the end-to-end constraints that regulate the external input/output (I/O) semantics of the head and tail tasks in each effect chain during the design phase and preserve them during implementation. The logical execution time (LET) model has been adopted by the industry due to the predictability and composability of its timed behavior. However, during the execution of LET-based effect chains, there are ineffective jobs whose outputs are redundant or unused and do not contribute to the external I/O behavior. This letter proposes an offline optimization method for deriving multiframe tasks that achieve the external timed I/O semantics of the LET-based effect chains with reduced utilization. The method first removes ineffective jobs from each effect chain and further explores the benefits of removing jobs for single and crossing effect chains by loosening the LET interval. The method is evaluated using synthetic benchmarks that mimic real-world automotive applications.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"198-201"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135699686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3298737
Anandpreet Kaur;Pravin Srivastav;Bibhas Ghoshal
In this article, we introduce Flip-On-Chip, the first end-to-end tool that thoroughly examines the vulnerability of embedded DRAM against rowhammer bit flips. Our tool, Flip-On-Chip, utilizes DRAM address mapping information to efficiently and deterministically perform a double-sided RowHammer test. We evaluated Flip-On-Chip on two DRAM modules: 1) LPDDR2 and 2) LPDDR4. It is found that our proposed tool increases the number of bit flips by 7.34 % on LPDDR2 and by 99.97 % on LPDDR4, as compared to state-of-the-art approaches provided in the literature. Additionally, Flip-On-Chip takes into account a number of system-level parameters to evaluate their influence on triggering Rowhammer bit flips.
{"title":"Flipping Bits Like a Pro: Precise Rowhammering on Embedded Devices","authors":"Anandpreet Kaur;Pravin Srivastav;Bibhas Ghoshal","doi":"10.1109/LES.2023.3298737","DOIUrl":"10.1109/LES.2023.3298737","url":null,"abstract":"In this article, we introduce Flip-On-Chip, the first end-to-end tool that thoroughly examines the vulnerability of embedded DRAM against rowhammer bit flips. Our tool, Flip-On-Chip, utilizes DRAM address mapping information to efficiently and deterministically perform a double-sided RowHammer test. We evaluated Flip-On-Chip on two DRAM modules: 1) LPDDR2 and 2) LPDDR4. It is found that our proposed tool increases the number of bit flips by 7.34 % on LPDDR2 and by 99.97 % on LPDDR4, as compared to state-of-the-art approaches provided in the literature. Additionally, Flip-On-Chip takes into account a number of system-level parameters to evaluate their influence on triggering Rowhammer bit flips.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"218-221"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135702640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3299638
Nikolay Penkov;Konstantinos Balaskas;Martin Rapp;Joerg Henkel
Transformer models are continuously achieving state-of-the-art performance on a wide range of benchmarks. To meet demanding performance targets, the number of model parameters is continuously increased. As a result, state-of-the-art Transformers require substantial computational resources prohibiting their deployment on consumer-grade hardware. In the literature, overparameterized Transformers are successfully reduced in size with the help of pruning strategies. Existing works lack the ability to optimize the full architecture, without incurring significant overheads, in a fully differentiable manner. Our work proposes a single-stage approach for training a Transformer for memory-efficient inference and various resource-constrained scenarios. Transformer blocks are extended with trainable gate parameters, which attribute importance and control information flow. Their integration into a differentiable pruning-aware training scheme allows the extraction of extremely sparse subnetworks at runtime, with minimal performance degradation. Evaluative pruning results, at the attention head and layer levels, illustrate the memory efficiency of our trained subnetworks under various memory budgets.
{"title":"Differentiable Slimming for Memory-Efficient Transformers","authors":"Nikolay Penkov;Konstantinos Balaskas;Martin Rapp;Joerg Henkel","doi":"10.1109/LES.2023.3299638","DOIUrl":"10.1109/LES.2023.3299638","url":null,"abstract":"Transformer models are continuously achieving state-of-the-art performance on a wide range of benchmarks. To meet demanding performance targets, the number of model parameters is continuously increased. As a result, state-of-the-art Transformers require substantial computational resources prohibiting their deployment on consumer-grade hardware. In the literature, overparameterized Transformers are successfully reduced in size with the help of pruning strategies. Existing works lack the ability to optimize the full architecture, without incurring significant overheads, in a fully differentiable manner. Our work proposes a single-stage approach for training a Transformer for memory-efficient inference and various resource-constrained scenarios. Transformer blocks are extended with trainable gate parameters, which attribute importance and control information flow. Their integration into a differentiable pruning-aware training scheme allows the extraction of extremely sparse subnetworks at runtime, with minimal performance degradation. Evaluative pruning results, at the attention head and layer levels, illustrate the memory efficiency of our trained subnetworks under various memory budgets.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"186-189"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135699987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3298736
Yuhao Liu;Shubham Rai;Salim Ullah;Akash Kumar
Recent research widely explored the quantization schemes on hardware. However, for recent accelerators only supporting 8 bits quantization, such as Google TPU, the lower-precision inputs, such as 1/2-bit quantized neural network models in FINN, need to extend the data width to meet the hardware interface requirements. This conversion influences communication and computing efficiency. To improve the flexibility and throughput of quantized multipliers, our work explores two novel reconfigurable multiplier designs that can repartition the number of input channels in runtime based on input precision and reconfigure the signed/unsigned multiplication modes. In this letter, we explored two novel runtime reconfigurable multi-precision multipliers based on the multiplier-tree and bit-serial multiplier architectures. We evaluated our designs by implementing a systolic array and single-layer neural network accelerator on the Ultra96 FPGA platform. The result shows the flexibility of our implementation and the high speedup for low-precision quantized multiplication working with a fixed data width of the hardware interface.
{"title":"High-Flexibility Designs of Quantized Runtime Reconfigurable Multi-Precision Multipliers","authors":"Yuhao Liu;Shubham Rai;Salim Ullah;Akash Kumar","doi":"10.1109/LES.2023.3298736","DOIUrl":"10.1109/LES.2023.3298736","url":null,"abstract":"Recent research widely explored the quantization schemes on hardware. However, for recent accelerators only supporting 8 bits quantization, such as Google TPU, the lower-precision inputs, such as 1/2-bit quantized neural network models in FINN, need to extend the data width to meet the hardware interface requirements. This conversion influences communication and computing efficiency. To improve the flexibility and throughput of quantized multipliers, our work explores two novel reconfigurable multiplier designs that can repartition the number of input channels in runtime based on input precision and reconfigure the signed/unsigned multiplication modes. In this letter, we explored two novel runtime reconfigurable multi-precision multipliers based on the multiplier-tree and bit-serial multiplier architectures. We evaluated our designs by implementing a systolic array and single-layer neural network accelerator on the Ultra96 FPGA platform. The result shows the flexibility of our implementation and the high speedup for low-precision quantized multiplication working with a fixed data width of the hardware interface.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"194-197"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135699988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3318591
Wei-Ting Hsu;Pei-Yu Lo;Chi-Wei Chen;Chin-Wei Tien;Sy-Yen Kuo
HT has become a serious threat to the Internet of Things due to the globalization of the integrated circuit industry. To evade functional verification, HTs tend to have at least one trigger signal at the gate-level netlist with a very low transition probability. Based on this nature, previous studies use imbalanced controllability as a feature to detect HTs, assuming that signals with imbalanced controllability are always accompanied by low transition probability. However, this study has found out a way to create a new type of HT that has low transition probability but balanced controllability, against previous methods. Hence, current imbalanced controllability detectors are inadequate in this scenario. To address this limitation, we propose a probability-based detection method that uses unsupervised anomaly analysis to detect HTs. Our proposed method detects not only the proposed HT but also the 580 Trojan benchmarks on Trusthub. Experimental results show that our proposed detector outperforms other detectors, achieving an overall 100% true positive rate and 0.37% false positive rate on the 580 benchmarks.
{"title":"Hardware Trojan Detection Method Against Balanced Controllability Trigger Design","authors":"Wei-Ting Hsu;Pei-Yu Lo;Chi-Wei Chen;Chin-Wei Tien;Sy-Yen Kuo","doi":"10.1109/LES.2023.3318591","DOIUrl":"10.1109/LES.2023.3318591","url":null,"abstract":"HT has become a serious threat to the Internet of Things due to the globalization of the integrated circuit industry. To evade functional verification, HTs tend to have at least one trigger signal at the gate-level netlist with a very low transition probability. Based on this nature, previous studies use imbalanced controllability as a feature to detect HTs, assuming that signals with imbalanced controllability are always accompanied by low transition probability. However, this study has found out a way to create a new type of HT that has low transition probability but balanced controllability, against previous methods. Hence, current imbalanced controllability detectors are inadequate in this scenario. To address this limitation, we propose a probability-based detection method that uses unsupervised anomaly analysis to detect HTs. Our proposed method detects not only the proposed HT but also the 580 Trojan benchmarks on Trusthub. Experimental results show that our proposed detector outperforms other detectors, achieving an overall 100% true positive rate and 0.37% false positive rate on the 580 benchmarks.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 2","pages":"178-181"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135700341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3299214
Hassan Nassar;Lars Bauer;Jörg Henkel
Physical unclonable functions (PUFs) are a handy security primitive for resource-constrained devices. They offer an alternative to the resource-intensive classical hash algorithms. Using the IC differences resulting from the fabrication process, PUFs give device-specific outputs (responses) when given the same inputs (challenges). Hence, without using a device-specific key, PUFs can generate device-specific responses. FPGAs are one of the platforms that are heavily studied as a candidate for PUF implementation. The idea is that a PUF that is designed as an HDL code can be used as part of the static design or as a dynamic accelerator. Previous works studied PUF implementation as part of the static design. In contrast to the state-of-the-art, this letter studies PUFs when used as runtime reconfigurable accelerators. In this letter, we find that not all regions of an FPGA are equally suitable for implementing different PUF types. Regions, where clock routing resources exist, are the worst suited for PUF implementation. Moreover, we find out that for certain PUF types, the property of dynamic partial reconfiguration can lead to performance degradation if not applied carefully. When static routing passing through the region increases, the PUF performance degrades significantly.
{"title":"Effects of Runtime Reconfiguration on PUFs Implemented as FPGA-Based Accelerators","authors":"Hassan Nassar;Lars Bauer;Jörg Henkel","doi":"10.1109/LES.2023.3299214","DOIUrl":"10.1109/LES.2023.3299214","url":null,"abstract":"Physical unclonable functions (PUFs) are a handy security primitive for resource-constrained devices. They offer an alternative to the resource-intensive classical hash algorithms. Using the IC differences resulting from the fabrication process, PUFs give device-specific outputs (responses) when given the same inputs (challenges). Hence, without using a device-specific key, PUFs can generate device-specific responses. FPGAs are one of the platforms that are heavily studied as a candidate for PUF implementation. The idea is that a PUF that is designed as an HDL code can be used as part of the static design or as a dynamic accelerator. Previous works studied PUF implementation as part of the static design. In contrast to the state-of-the-art, this letter studies PUFs when used as runtime reconfigurable accelerators. In this letter, we find that not all regions of an FPGA are equally suitable for implementing different PUF types. Regions, where clock routing resources exist, are the worst suited for PUF implementation. Moreover, we find out that for certain PUF types, the property of dynamic partial reconfiguration can lead to performance degradation if not applied carefully. When static routing passing through the region increases, the PUF performance degrades significantly.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"174-177"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135700348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3298743
Sushree Sila P. Goswami;Gaurav Trivedi
Security plays a vital role in electronic communication, particularly in wireless networks like long term evolution (LTE), where safeguarding data and resources from malicious activities is crucial. Cryptographic algorithms are at the core of security mechanisms, ensuring the protection of sensitive information. While software implementations of these algorithms are relatively straightforward, they often need more speed in real-time applications for communication devices like mobile phones. Consequently, implementation of these cryptographic algorithms as hardware crypto processors becomes necessary. This letter presents a novel implementation of the SNOW3G crypto processor architecture for the 4G LTE security applications, focusing on area, power, and efficiency. The proposed modified SNOW3G architecture utilizes only 0.31% of the available area when implemented on an FPGA Zynq ZC702 and achieves 28.34 efficiency which quantifies as the ratio of throughput to the area. Furthermore, it consumes a total power of 0.142 mW. These low power and area requirements make the design highly suitable for integration into mobile devices, meeting their specific constraints and enabling efficient cryptographic operations.
{"title":"FPGA Implementation of Modified SNOW 3G Stream Ciphers Using Fast and Resource Efficient Substitution Box","authors":"Sushree Sila P. Goswami;Gaurav Trivedi","doi":"10.1109/LES.2023.3298743","DOIUrl":"10.1109/LES.2023.3298743","url":null,"abstract":"Security plays a vital role in electronic communication, particularly in wireless networks like long term evolution (LTE), where safeguarding data and resources from malicious activities is crucial. Cryptographic algorithms are at the core of security mechanisms, ensuring the protection of sensitive information. While software implementations of these algorithms are relatively straightforward, they often need more speed in real-time applications for communication devices like mobile phones. Consequently, implementation of these cryptographic algorithms as hardware crypto processors becomes necessary. This letter presents a novel implementation of the SNOW3G crypto processor architecture for the 4G LTE security applications, focusing on area, power, and efficiency. The proposed modified SNOW3G architecture utilizes only 0.31% of the available area when implemented on an FPGA Zynq ZC702 and achieves 28.34 efficiency which quantifies as the ratio of throughput to the area. Furthermore, it consumes a total power of 0.142 mW. These low power and area requirements make the design highly suitable for integration into mobile devices, meeting their specific constraints and enabling efficient cryptographic operations.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"238-241"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135700003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1109/LES.2023.3298900
Dejian Li;Xi Feng;Chongfei Shen;Qi Chen;Lixin Yang;Sihai Qiu;Xin Jin;Meng Liu
This letter introduces a dedicated processor architecture, called MEGACORE, which leverages vector technology to enhance tracking performance in visual simultaneous localization and mapping (VSLAM) systems. By harnessing the inherent parallelism of vector processing and incorporating a floating point unit (FPU), MEGACORE achieves significant acceleration in the tracking task of VSLAM. Through careful optimizations, we achieved notable improvements compared to the baseline design. Our optimizations resulted in a 14.9% reduction in the area parameter and a 4.4% reduction in power consumption. Furthermore, by conducting application benchmarks, we determined that the average speedup ratio across all stages of the tracking process is 3.25. These findings highlight the effectiveness of MEGACORE in improving the efficiency and performance of VSLAM systems, making it a promising solution for real-world implementations in embedded systems.
{"title":"Vector-Based Dedicated Processor Architecture for Efficient Tracking in VSLAM Systems","authors":"Dejian Li;Xi Feng;Chongfei Shen;Qi Chen;Lixin Yang;Sihai Qiu;Xin Jin;Meng Liu","doi":"10.1109/LES.2023.3298900","DOIUrl":"10.1109/LES.2023.3298900","url":null,"abstract":"This letter introduces a dedicated processor architecture, called MEGACORE, which leverages vector technology to enhance tracking performance in visual simultaneous localization and mapping (VSLAM) systems. By harnessing the inherent parallelism of vector processing and incorporating a floating point unit (FPU), MEGACORE achieves significant acceleration in the tracking task of VSLAM. Through careful optimizations, we achieved notable improvements compared to the baseline design. Our optimizations resulted in a 14.9% reduction in the area parameter and a 4.4% reduction in power consumption. Furthermore, by conducting application benchmarks, we determined that the average speedup ratio across all stages of the tracking process is 3.25. These findings highlight the effectiveness of MEGACORE in improving the efficiency and performance of VSLAM systems, making it a promising solution for real-world implementations in embedded systems.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"182-185"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135700340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Autonomous vehicles are latency-sensitive systems. The planning phase is a critical component of such systems, during which the in-vehicle compute platform is responsible for determining the future maneuvers that the vehicle will follow. In this letter, we present a GPU-accelerated optimized implementation of the Frenet Path Planner, a widely known path planning algorithm. Unlike the current state of the art, our implementation accelerates the entire algorithm, including the path generation and collision avoidance phases. We measure the execution time of our implementation and demonstrate dramatic speedups compared to the CPU baseline implementation. Additionally, we evaluate the impact of different precision types (double, float, and half) on trajectory errors to investigate the tradeoff between completion latencies and computation precision.
{"title":"Optimized Local Path Planner Implementation for GPU-Accelerated Embedded Systems","authors":"Filippo Muzzini;Nicola Capodieci;Federico Ramanzin;Paolo Burgio","doi":"10.1109/LES.2023.3298733","DOIUrl":"10.1109/LES.2023.3298733","url":null,"abstract":"Autonomous vehicles are latency-sensitive systems. The planning phase is a critical component of such systems, during which the in-vehicle compute platform is responsible for determining the future maneuvers that the vehicle will follow. In this letter, we present a GPU-accelerated optimized implementation of the Frenet Path Planner, a widely known path planning algorithm. Unlike the current state of the art, our implementation accelerates the entire algorithm, including the path generation and collision avoidance phases. We measure the execution time of our implementation and demonstrate dramatic speedups compared to the CPU baseline implementation. Additionally, we evaluate the impact of different precision types (double, float, and half) on trajectory errors to investigate the tradeoff between completion latencies and computation precision.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"214-217"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135700673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stochastic computing (SC) is an emerging paradigm that offers hardware-efficient solutions for developing low-cost and noise-robust architectures. In SC, deterministic logic systems are employed along with bit-stream sources to process scalar values. However, using long bit-streams introduces challenges, such as increased latency and significant energy consumption. To address these issues, we present an optimization-oriented approach for modeling and sizing new logic gates, which results in optimal latency. The optimization process is automated using hardware–software cooperation by integrating Cadence and MATLAB environments. Initially, we optimize the circuit topology by leveraging the design parameters of two-input basic logic gates. This optimization is performed using a multiobjective approach based on a deep neural network. Subsequently, we employ the proposed gates to demonstrate favorable solutions targeting SC-based operations.
{"title":"Hardware–Software Co-Optimization of Long-Latency Stochastic Computing","authors":"Sercan Aygun;Lida Kouhalvandi;M. Hassan Najafi;Serdar Ozoguz;Ece Olcay Gunes","doi":"10.1109/LES.2023.3298734","DOIUrl":"10.1109/LES.2023.3298734","url":null,"abstract":"Stochastic computing (SC) is an emerging paradigm that offers hardware-efficient solutions for developing low-cost and noise-robust architectures. In SC, deterministic logic systems are employed along with bit-stream sources to process scalar values. However, using long bit-streams introduces challenges, such as increased latency and significant energy consumption. To address these issues, we present an optimization-oriented approach for modeling and sizing new logic gates, which results in optimal latency. The optimization process is automated using hardware–software cooperation by integrating Cadence and MATLAB environments. Initially, we optimize the circuit topology by leveraging the design parameters of two-input basic logic gates. This optimization is performed using a multiobjective approach based on a deep neural network. Subsequently, we employ the proposed gates to demonstrate favorable solutions targeting SC-based operations.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"15 4","pages":"190-193"},"PeriodicalIF":1.6,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135700675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}