Pub Date : 2024-07-01Epub Date: 2024-05-26DOI: 10.1016/j.micpro.2024.105073
Ryan F. Ortiz, Wei-Ming Lin
Simultaneous Multithreading (SMT) allows for a processor to concurrently execute multiple independent threads while sharing certain data path components to optimize resource waste. Speculative execution allows for these processors to take advantage of Instruction-Level Parallelism but the penalty for a miss speculation includes the wasting of resources amongst these shared resources where clock cycles are wasted at a time. In this paper we show that an average of 13 % of instructions are flushed as a result of incorrect predictions. These flushed out instructions could have potentially taken up shared resources which other non-speculative threads could have used. This paper proposes a technique that can dynamically adjust how many speculative instructions a thread can rename and decode aiming to diminish the waste of the shared resources. Our simulation results show, with the proposed technique, that the average flushed out instruction rate is reduced by 23 % and average throughput is improved by 13 %.
{"title":"Improving performance of simultaneous multithreading CPUs using autonomous control of speculative traces","authors":"Ryan F. Ortiz, Wei-Ming Lin","doi":"10.1016/j.micpro.2024.105073","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105073","url":null,"abstract":"<div><p>Simultaneous Multithreading (SMT) allows for a processor to concurrently execute multiple independent threads while sharing certain data path components to optimize resource waste. Speculative execution allows for these processors to take advantage of Instruction-Level Parallelism but the penalty for a miss speculation includes the wasting of resources amongst these shared resources where clock cycles are wasted at a time. In this paper we show that an average of 13 % of instructions are flushed as a result of incorrect predictions. These flushed out instructions could have potentially taken up shared resources which other non-speculative threads could have used. This paper proposes a technique that can dynamically adjust how many speculative instructions a thread can rename and decode aiming to diminish the waste of the shared resources. Our simulation results show, with the proposed technique, that the average flushed out instruction rate is reduced by 23 % and average throughput is improved by 13 %.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"108 ","pages":"Article 105073"},"PeriodicalIF":2.6,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141242952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1016/j.micpro.2024.105047
P. Santos , E. Mendes , J. Carvalho , F. Alves , J. Azevedo , J. Cabral
Active Noise Cancellation (ANC) systems are widely used to mitigate unwanted noises in several applications, such as automotive environments and high-end headsets. Multi-Channel (MC) ANC systems have shown promise in creating improved silent zones. Typically, these systems are implemented on FPGA platforms due to the systolic nature and granularity of optimization of these devices. This article describes the design, implementation, and evaluation of a wavelet-based MC ANC Filtered-x Normalized Least Mean Square (FxNLMS) on an FPGA platform.
The use of wavelet transform enables the decomposition of complex noise signals into spectrally more compact signals (i.e., easier to process). In this work, for each decomposed signal, an independent NLMS is applied. The system implements 64 parallel NLMS with 1000 coefficients. Additionally, the static FIR filters employed for secondary and tertiary path estimations are of the 2047th order. The system adopts an integer arithmetic architecture and operates at a sampling rate of 47.97 kHz. To assess the performance of the wavelet-based approach, benchmark tests were conducted by comparing it against a similar implementation without the wavelet transform. The evaluation was performed using noise reduction (NR) tests with spectrally rich (20 Hz to 10 kHz) and high dynamic range noises. The experimental setup involved two error microphones and two secondary sources.
The results show that the wavelet-based version has overall better performance than the traditional implementation, particularly in the higher frequency band of the spectrum (1 kHz to 8 kHz). For instance, in the case of city ambient noise (a realistic noise with high dynamic range), the relative NR achieved was 8.23 dB.
To the authors’ knowledge, this is the first time that the implementation and field-test of a wavelet-based MC ANC on an FPGA platform was conducted. Moreover, the obtained results show that the novel approach is better in reducing complex noises than the traditional implementation – without wavelets.
{"title":"Hardware accelerated Active Noise Cancellation system using Haar wavelets","authors":"P. Santos , E. Mendes , J. Carvalho , F. Alves , J. Azevedo , J. Cabral","doi":"10.1016/j.micpro.2024.105047","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105047","url":null,"abstract":"<div><p>Active Noise Cancellation (<em>ANC</em>) systems are widely used to mitigate unwanted noises in several applications, such as automotive environments and high-end headsets. Multi-Channel (<em>MC</em>) <em>ANC</em> systems have shown promise in creating improved silent zones. Typically, these systems are implemented on <em>FPGA</em> platforms due to the systolic nature and granularity of optimization of these devices. This article describes the design, implementation, and evaluation of a wavelet-based <em>MC ANC</em> Filtered-x Normalized Least Mean Square (<em>FxNLMS</em>) on an <em>FPGA</em> platform.</p><p>The use of wavelet transform enables the decomposition of complex noise signals into spectrally more compact signals (i.e., easier to process). In this work, for each decomposed signal, an independent <em>NLMS</em> is applied. The system implements 64 parallel <em>NLMS</em> with 1000 coefficients. Additionally, the static <em>FIR</em> filters employed for secondary and tertiary path estimations are of the 2047th order. The system adopts an integer arithmetic architecture and operates at a sampling rate of 47.97 kHz. To assess the performance of the wavelet-based approach, benchmark tests were conducted by comparing it against a similar implementation without the wavelet transform. The evaluation was performed using noise reduction (<em>NR</em>) tests with spectrally rich (20 Hz to 10 kHz) and high dynamic range noises. The experimental setup involved two error microphones and two secondary sources.</p><p>The results show that the wavelet-based version has overall better performance than the traditional implementation, particularly in the higher frequency band of the spectrum (1 kHz to 8 kHz). For instance, in the case of city ambient noise (a realistic noise with high dynamic range), the relative <em>NR</em> achieved was 8.23 dB.</p><p>To the authors’ knowledge, this is the first time that the implementation and field-test of a wavelet-based <em>MC ANC</em> on an <em>FPGA</em> platform was conducted. Moreover, the obtained results show that the novel approach is better in reducing complex noises than the traditional implementation – without wavelets.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105047"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141933124000425/pdfft?md5=694a4b8ef90eac68e2e659134a17a6f8&pid=1-s2.0-S0141933124000425-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140539189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2021-01-23DOI: 10.1016/j.micpro.2021.104040
Xiaoyan Lei , Huibin Wang , Jie Shen , Zhe Chen , Weidong Zhang
With the rapid development of artificial intelligence, image processing technology has been more and more widely used. Image enhancement is an important part of image processing, and has become a research hotspot of theory and application of image processing technology. This article proposes a new method for underwater image enhancement to solve the problems of color distortion, low contrast and blurring in underwater images. The compensation factor is used to compensate the badly damaged color channels, and the compensation factor is constructed by the mean differences between the damaged color channels and the well-preserved color channel. Then, multi-scale convolution MSRCR technology is used to denoising and correct color distortion, in conclusion, CLAHS and global contrast stretching are used to improve the local and global contrast of the images. Qualitative and quantitative evaluations prove that the proposed method can solve the color cast effect and improve the contrast of underwater images. The images processed by our method have natural color, high contrast and high clarity. Similarly, our method can also achieve good results in underwater low light and underwater images captured by different camera scenes.
{"title":"A novel intelligent underwater image enhancement method via color correction and contrast stretching✰","authors":"Xiaoyan Lei , Huibin Wang , Jie Shen , Zhe Chen , Weidong Zhang","doi":"10.1016/j.micpro.2021.104040","DOIUrl":"10.1016/j.micpro.2021.104040","url":null,"abstract":"<div><p>With the rapid development of artificial intelligence, image processing technology has been more and more widely used. Image enhancement is an important part of image processing, and has become a research hotspot of theory and application of image processing technology. This article proposes a new method for underwater image enhancement to solve the problems of color distortion, low contrast and blurring in underwater images. The compensation factor is used to compensate the badly damaged color channels, and the compensation factor is constructed by the mean differences between the damaged color channels and the well-preserved color channel. Then, multi-scale convolution MSRCR technology is used to denoising and correct color distortion, in conclusion, CLAHS and global contrast stretching are used to improve the local and global contrast of the images. Qualitative and quantitative evaluations prove that the proposed method can solve the color cast effect and improve the contrast of underwater images. The images processed by our method have natural color, high contrast and high clarity. Similarly, our method can also achieve good results in underwater low light and underwater images captured by different camera scenes.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 104040"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45588130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1016/j.micpro.2024.105048
Sudhanshu Janwadkar, Rasika Dhavse
Impedance cardiography (ICG) is a rapidly growing non-invasive cardiac health monitoring approach. Synchronous detection of ICG requires an FIR filter to remove the high-frequency carrier signal. Low power consumption and compact area are critical considerations in the design of portable biomedical systems. This paper proposes a novel product quantization-based optimization strategy for the Urdhva Tiryagbhyam Sutra-based multiplier architecture. This paper presents an ASIC design of a low-power and low-area 64th-order programmable FIR filter architecture using the optimized Urdhva Tiryagbhyam Multiplier. The programmable architecture empowers medical practitioners to select the carrier frequency at which the ICG analysis will be performed. The elimination of redundant multipliers from the design based on the filter coefficients is demonstrated. The programmable Vedic FIR filter architecture (described in VHDL) is implemented on the Basys-3 FPGA board for rapid prototyping. The RTL-to-GDSII flow has been completed using Cadence digital design and sign-off tools for the SCL-180 nm technology. The results indicate that the proposed filter architecture occupies 41.33% less area and 42.16% lower power consumption than the contemporary designs described in the literature.
{"title":"ASIC design of power and area efficient programmable FIR filter using optimized Urdhva-Tiryagbhyam Multiplier for impedance cardiography","authors":"Sudhanshu Janwadkar, Rasika Dhavse","doi":"10.1016/j.micpro.2024.105048","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105048","url":null,"abstract":"<div><p>Impedance cardiography (ICG) is a rapidly growing non-invasive cardiac health monitoring approach. Synchronous detection of ICG requires an FIR filter to remove the high-frequency carrier signal. Low power consumption and compact area are critical considerations in the design of portable biomedical systems. This paper proposes a novel product quantization-based optimization strategy for the Urdhva Tiryagbhyam Sutra-based multiplier architecture. This paper presents an ASIC design of a low-power and low-area 64th-order programmable FIR filter architecture using the optimized Urdhva Tiryagbhyam Multiplier. The programmable architecture empowers medical practitioners to select the carrier frequency at which the ICG analysis will be performed. The elimination of redundant multipliers from the design based on the filter coefficients is demonstrated. The programmable Vedic FIR filter architecture (described in VHDL) is implemented on the Basys-3 FPGA board for rapid prototyping. The RTL-to-GDSII flow has been completed using Cadence digital design and sign-off tools for the SCL-180 nm technology. The results indicate that the proposed filter architecture occupies 41.33% less area and 42.16% lower power consumption than the contemporary designs described in the literature.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105048"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140641142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a theoretical context of side-channel attacks, optimal bounds between success rate, guessing entropy and statistical distance are derived with a simple majorization (Schur-concavity) argument. They are further theoretically refined for different versions of the classical Hamming weight leakage model, in particular assuming a priori equiprobable secret keys and additive white Gaussian measurement noise. Closed-form expressions and numerical computation are given. A study of the impact of the choice of the substitution box with respect to side-channel resistance reveals that its nonlinearity tends to homogenize the expressivity of success rate, guessing entropy and statistical distance. The intriguing approximate relation between guessing entropy and success rate is observed in the case of 8-bit bytes and low noise. The exact relation between guessing entropy, statistical distance and alphabet size for deterministic leakages and equiprobable keys is proved.
{"title":"Be My Guesses: The interplay between side-channel leakage metrics","authors":"Julien Béguinot , Wei Cheng , Sylvain Guilley , Olivier Rioul","doi":"10.1016/j.micpro.2024.105045","DOIUrl":"10.1016/j.micpro.2024.105045","url":null,"abstract":"<div><p>In a theoretical context of side-channel attacks, optimal bounds between success rate, guessing entropy and statistical distance are derived with a simple majorization (Schur-concavity) argument. They are further theoretically refined for different versions of the classical Hamming weight leakage model, in particular assuming a priori equiprobable secret keys and additive white Gaussian measurement noise. Closed-form expressions and numerical computation are given. A study of the impact of the choice of the substitution box with respect to side-channel resistance reveals that its nonlinearity tends to homogenize the expressivity of success rate, guessing entropy and statistical distance. The intriguing approximate relation between guessing entropy and success rate <span><math><mrow><mi>G</mi><mi>E</mi><mo>=</mo><mn>1</mn><mo>/</mo><mi>S</mi><mi>R</mi></mrow></math></span> is observed in the case of 8-bit bytes and low noise. The exact relation between guessing entropy, statistical distance and alphabet size <span><math><mrow><mi>G</mi><mi>E</mi><mo>=</mo><mfrac><mrow><mi>M</mi><mo>+</mo><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><mo>−</mo><mfrac><mrow><mi>M</mi></mrow><mrow><mn>2</mn></mrow></mfrac><mi>S</mi><mi>D</mi></mrow></math></span> for deterministic leakages and equiprobable keys is proved.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105045"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140401449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an FPGA implementation of Number Theoretic Transform (NTT) for the Kyber Post-Quantum Cryptographic (PQC) standard. NTT is the slowest process within Kyber thus a large number of efforts has been conducted to enhance its computational efficiency. Leveraging parallelism and dedicated multipliers, our design achieves state-of-the-art latency, performing NTT/INTT in just 0.4/, surpassing existing designs by at least 3.75/3 times. The proposed design is implemented on the cost-effective Artix-7 FPGA.
{"title":"Low latency FPGA implementation of NTT for Kyber","authors":"Mohamed Saoudi, Akram Kermiche, Omar Hocine Benhaddad, Nadir Guetmi, Boufeldja Allailou","doi":"10.1016/j.micpro.2024.105059","DOIUrl":"10.1016/j.micpro.2024.105059","url":null,"abstract":"<div><p>This paper presents an FPGA implementation of Number Theoretic Transform (NTT) for the Kyber Post-Quantum Cryptographic (PQC) standard. NTT is the slowest process within Kyber thus a large number of efforts has been conducted to enhance its computational efficiency. Leveraging parallelism and dedicated multipliers, our design achieves state-of-the-art latency, performing NTT/INTT in just 0.4/<span><math><mrow><mn>0</mn><mo>.</mo><mn>5</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span>, surpassing existing designs by at least 3.75/3 times. The proposed design is implemented on the cost-effective Artix-7 FPGA.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105059"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140792938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-09DOI: 10.1016/j.micpro.2024.105050
Juan Encinas, Alfonso Rodríguez, Andrés Otero, Eduardo de la Torre
Reconfigurable multi-accelerator systems used as computing offloading platforms in edge-cloud continuum scenarios usually have to deal with highly dynamic workloads and operating conditions. In order to properly take advantage of their parallel processing capabilities and increase execution performance for a given workload, these systems need to continuously adapt their configuration (i.e., number and type of accelerators) at run time. When working at the edge, additional requirements such as energy efficiency must be also met. In this paper, Machine Learning techniques are applied to extract predictive models of the execution of different combinations of hardware accelerators on a reconfigurable multi-accelerator platform, aiming at satisfying the previously mentioned continuous optimization needs. One of the key benefits of the proposed approach is that its data-driven models can transparently estimate the impact of the complex interactions between hardware accelerators due to run-time resource contention among them and with the rest of the system, as opposed to traditional modeling approaches that cannot include that information in an easy and scalable way (e.g., analytical models). The proposed models are complemented with a complete infrastructure to generate, execute and monitor dynamic workloads in FPGA-based systems. This infrastructure has been used to (i) quantitatively analyze resource contention in reconfigurable multi-accelerator systems and (ii) produce the training and evaluation datasets for the ML-based models using annotated power consumption and execution performance traces. Experimental results obtained with a reconfigurable multi-accelerator platform based on the ARTICo3 framework running the MachSuite benchmarks show that the proposed modeling approach is highly effective, with a relative prediction error of less than 5% on average for both power consumption and execution performance. Result also show that the ML-based models achieve high accuracy levels when predicting the impact of resource contention and accelerator interaction on both metrics, with a mean relative prediction error of less than 0.6% and a standard deviation below 4.7% for the worst case.
可重构多加速器系统作为边缘-云连续场景中的计算卸载平台,通常需要处理高度动态的工作负载和运行条件。为了适当利用其并行处理能力并提高给定工作负载的执行性能,这些系统需要在运行时不断调整其配置(即加速器的数量和类型)。在边缘工作时,还必须满足能效等额外要求。本文应用机器学习技术,提取可重构多加速器平台上不同硬件加速器组合的执行预测模型,旨在满足前面提到的持续优化需求。与传统建模方法(如分析模型)相比,该方法无法以简便、可扩展的方式包含这些信息,因此无法估算硬件加速器之间因运行时资源争用而产生的复杂交互影响。所提出的模型与完整的基础架构相辅相成,可用于生成、执行和监控基于 FPGA 系统的动态工作负载。该基础架构已被用于:(i) 定量分析可重构多加速器系统中的资源争用情况;(ii) 利用注释功耗和执行性能跟踪为基于 ML 的模型生成训练和评估数据集。在基于 ARTICo3 框架的可重构多加速器平台上运行 MachSuite 基准所获得的实验结果表明,所提出的建模方法非常有效,在功耗和执行性能方面的相对预测误差平均小于 5%。结果还显示,基于 ML 的模型在预测资源争用和加速器交互对这两项指标的影响时达到了很高的准确度,平均相对预测误差小于 0.6%,最坏情况下的标准偏差低于 4.7%。
{"title":"Data-driven modeling of reconfigurable multi-accelerator systems under dynamic workloads","authors":"Juan Encinas, Alfonso Rodríguez, Andrés Otero, Eduardo de la Torre","doi":"10.1016/j.micpro.2024.105050","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105050","url":null,"abstract":"<div><p>Reconfigurable multi-accelerator systems used as computing offloading platforms in edge-cloud continuum scenarios usually have to deal with highly dynamic workloads and operating conditions. In order to properly take advantage of their parallel processing capabilities and increase execution performance for a given workload, these systems need to continuously adapt their configuration (i.e., number and type of accelerators) at run time. When working at the edge, additional requirements such as energy efficiency must be also met. In this paper, Machine Learning techniques are applied to extract predictive models of the execution of different combinations of hardware accelerators on a reconfigurable multi-accelerator platform, aiming at satisfying the previously mentioned continuous optimization needs. One of the key benefits of the proposed approach is that its data-driven models can transparently estimate the impact of the complex interactions between hardware accelerators due to run-time resource contention among them and with the rest of the system, as opposed to traditional modeling approaches that cannot include that information in an easy and scalable way (e.g., analytical models). The proposed models are complemented with a complete infrastructure to generate, execute and monitor dynamic workloads in FPGA-based systems. This infrastructure has been used to (i) quantitatively analyze resource contention in reconfigurable multi-accelerator systems and (ii) produce the training and evaluation datasets for the ML-based models using annotated power consumption and execution performance traces. Experimental results obtained with a reconfigurable multi-accelerator platform based on the ARTICo<sup>3</sup> framework running the MachSuite benchmarks show that the proposed modeling approach is highly effective, with a relative prediction error of less than 5% on average for both power consumption and execution performance. Result also show that the ML-based models achieve high accuracy levels when predicting the impact of resource contention and accelerator interaction on both metrics, with a mean relative prediction error of less than 0.6% and a standard deviation below 4.7% for the worst case.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105050"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141933124000450/pdfft?md5=a52d32f5fafee4bda56df513540d6eb8&pid=1-s2.0-S0141933124000450-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140545665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-15DOI: 10.1016/j.micpro.2024.105058
Farhad EbrahimiAzandaryani, Dietmar Fey
This paper presents an effective -architectural design method, called ExTern, to enhance the performance of a RISC-V processor experiencing computation bottlenecks. ExTern involves integrating Canonical Signed Digit (CSD) representation, a ternary number system enabling carry/borrow-free addition/subtraction in constant time , into the RISC-V processor, particularly into the execution stage. Furthermore, it adopts an extended six-stage pipeline architecture to maximize employed encoding benefits, leading to more improvement in overall execution time and throughput. Despite the presence of optimized circuits, such as fast carry chain (CARRY4) modules for binary encoding on FPGA, the customized processor applying ExTern, RISC-VT, showcases remarkable improvement in computation performance. Experimental results demonstrate a 34.3% (12.2%) improvement in working frequency leading to a lower 31% (11.5%) execution time and a 32% (12%) increase in throughput compared to a State-of-the-Art open-source five(six)-stage RISC-V processor.
{"title":"ExTern: Boosting RISC-V core performance using ternary encoding","authors":"Farhad EbrahimiAzandaryani, Dietmar Fey","doi":"10.1016/j.micpro.2024.105058","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105058","url":null,"abstract":"<div><p>This paper presents an effective <span><math><mi>μ</mi></math></span>-architectural design method, called ExTern, to enhance the performance of a RISC-V processor experiencing computation bottlenecks. ExTern involves integrating Canonical Signed Digit (CSD) representation, a ternary number system enabling carry/borrow-free addition/subtraction in constant time <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></mrow></math></span>, into the RISC-V processor, particularly into the execution stage. Furthermore, it adopts an extended six-stage pipeline architecture to maximize employed encoding benefits, leading to more improvement in overall execution time and throughput. Despite the presence of optimized circuits, such as fast carry chain (CARRY4) modules for binary encoding on FPGA, the customized processor applying ExTern, RISC-VT, showcases remarkable improvement in computation performance. Experimental results demonstrate a 34.3% (12.2%) improvement in working frequency leading to a lower 31% (11.5%) execution time and a 32% (12%) increase in throughput compared to a State-of-the-Art open-source five(six)-stage RISC-V processor.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105058"},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S014193312400053X/pdfft?md5=5219c364add625230da3e174054a963d&pid=1-s2.0-S014193312400053X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140620839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The localization techniques used in today’s smartphone are mainly based on Global Positioning System (GPS). However, GPS Sensors cannot work properly under in-door and underground locations. Therefore, many applications utilize device sensors such as accelerometer, gyrometer, and magnetometer for indoor localization. In this paper, we present a misuse case of how device sensors can be used to exploit the privacy of a user by geo-tracking. We propose an attack model through which the user location can be compromised without using the GPS sensors. The proposed attack model comprises of two stages. The first stage consists of deployment of the malicious application on the users’ smart-phones and gathering the information of various sensors in the background. The collected sensor data is uploaded to the malicious cloud server set up by the adversary. The second stage consists of pre-processing the sensor data received from the malicious cloud server and plot the user’s trajectory onto a graph in real-time. The proposed attack model is evaluated by developing two applications. The victim application tracks location, direction, and trajectory of the user without any location permission from the user. The proposed model achieves an accuracy of 98% without using special infrastructure and separate training phase. Further, we have discussed three mitigation schemes, which can be adapted by the Android developers in order to protect the user’s privacy.
{"title":"Indoor localization using device sensors: A threat to privacy","authors":"Hitesh Verma , Smita Naval , Balaprakasa Rao Killi , Vinod P.","doi":"10.1016/j.micpro.2024.105041","DOIUrl":"10.1016/j.micpro.2024.105041","url":null,"abstract":"<div><p>The localization techniques used in today’s smartphone are mainly based on Global Positioning System (GPS). However, GPS Sensors cannot work properly under in-door and underground locations. Therefore, many applications utilize device sensors such as accelerometer, gyrometer, and magnetometer for indoor localization. In this paper, we present a misuse case of how device sensors can be used to exploit the privacy of a user by geo-tracking. We propose an attack model through which the user location can be compromised without using the GPS sensors. The proposed attack model comprises of two stages. The first stage consists of deployment of the malicious application on the users’ smart-phones and gathering the information of various sensors in the background. The collected sensor data is uploaded to the malicious cloud server set up by the adversary. The second stage consists of pre-processing the sensor data received from the malicious cloud server and plot the user’s trajectory onto a graph in real-time. The proposed attack model is evaluated by developing two applications. The victim application tracks location, direction, and trajectory of the user without any location permission from the user. The proposed model achieves an accuracy of 98% without using special infrastructure and separate training phase. Further, we have discussed three mitigation schemes, which can be adapted by the Android developers in order to protect the user’s privacy.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"106 ","pages":"Article 105041"},"PeriodicalIF":2.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140087111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aggressively reducing the supply voltage () below the safe threshold voltage () can effectively lead to significant energy savings in digital circuits. However, operating at such low supply voltages poses challenges due to a high occurrence of permanent faults resulting from manufacturing process variations in current technology nodes.
This work addresses the impact of permanent faults on the accuracy of a Convolutional Neural Network (CNN) inference accelerator using on-chip activation memories supplied at low below . Based on a characterization study of fault patterns, this paper proposes two low-cost microarchitectural techniques, namely Flip-and-Patch, which maintain the original accuracy of CNN applications even in the presence of a high number of faults caused by operating at . Unlike existing techniques, Flip-and-Patch remains transparent to the programmer and does not rely on application characteristics, making it easily applicable to real CNN accelerators.
Experimental results show that Flip-and-Patch ensures the original CNN accuracy with a minimal impact on system performance (less than 0.05% for every application), while achieving average energy savings of 10.5% and 46.6% in activation memories compared to a conventional accelerator operating at safe and nominal supply voltages, respectively. Compared to the state-of-the-art ThUnderVolt technique, which dynamically adjusts the supply voltage at run time and discarding any energy overhead for such an approach, the average energy savings are by 3.2%.
{"title":"Flip-and-Patch: A fault-tolerant technique for on-chip memories of CNN accelerators at low supply voltage","authors":"Yamilka Toca-Díaz , Reynier Hernández Palacios , Rubén Gran Tejero , Alejandro Valero","doi":"10.1016/j.micpro.2024.105023","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105023","url":null,"abstract":"<div><p>Aggressively reducing the supply voltage (<span><math><msub><mrow><mi>V</mi></mrow><mrow><mi>d</mi><mi>d</mi></mrow></msub></math></span>) below the safe threshold voltage (<span><math><msub><mrow><mi>V</mi></mrow><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub></math></span>) can effectively lead to significant energy savings in digital circuits. However, operating at such low supply voltages poses challenges due to a high occurrence of permanent faults resulting from manufacturing process variations in current technology nodes.</p><p>This work addresses the impact of permanent faults on the accuracy of a Convolutional Neural Network (CNN) inference accelerator using on-chip activation memories supplied at low <span><math><msub><mrow><mi>V</mi></mrow><mrow><mi>d</mi><mi>d</mi></mrow></msub></math></span> below <span><math><msub><mrow><mi>V</mi></mrow><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub></math></span>. Based on a characterization study of fault patterns, this paper proposes two low-cost microarchitectural techniques, namely Flip-and-Patch, which maintain the original accuracy of CNN applications even in the presence of a high number of faults caused by operating at <span><math><mrow><msub><mrow><mi>V</mi></mrow><mrow><mi>d</mi><mi>d</mi></mrow></msub><mo><</mo><msub><mrow><mi>V</mi></mrow><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub></mrow></math></span>. Unlike existing techniques, Flip-and-Patch remains transparent to the programmer and does not rely on application characteristics, making it easily applicable to real CNN accelerators.</p><p>Experimental results show that Flip-and-Patch ensures the original CNN accuracy with a minimal impact on system performance (less than 0.05% for every application), while achieving average energy savings of 10.5% and 46.6% in activation memories compared to a conventional accelerator operating at safe and nominal supply voltages, respectively. Compared to the state-of-the-art ThUnderVolt technique, which dynamically adjusts the supply voltage at run time and discarding any energy overhead for such an approach, the average energy savings are by 3.2%.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"106 ","pages":"Article 105023"},"PeriodicalIF":2.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141933124000188/pdfft?md5=9ccb11570430c8c998f414582a020757&pid=1-s2.0-S0141933124000188-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}