Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893253
M. Corrêa, D. Palomino, G. Corrêa, L. Agostini
This work presents a fast decision algorithm and its hardware design for the AV1 intra prediction, inspired on the direction detection algorithm used on the CDEF (Constrained Directional Enhancement Filter) of the same codec. The main objective is to reduce the number of intra candidates with a low-cost heuristic, thus allowing a faster prediction time in software and also allowing a low-area and low-power intra prediction hardware design. The proposed algorithm was implemented in the AV1 reference encoder (libaom) and, experiments showed, on average, a 22.56% encoding time reduction, at a cost of 1.26% BD-BR increase. The hardware design synthesis, targeting the TSMC 40 nm and frequency of 951 MHz, resulted in an area and power of 39K NAND2 gates and 4.92 mW, respectively. This target frequency is enough for the processing of UHD 4K (3,840x2,160 pixels) videos at 30 frames per second. When considering the integration of this hardware with a directional AV1 intra prediction hardware, a dynamic power dissipation reduction of up to 93% is expected.
{"title":"Direction-Based Fast Mode Decision and Hardware Design for the AV1 Intra Prediction","authors":"M. Corrêa, D. Palomino, G. Corrêa, L. Agostini","doi":"10.1109/SBCCI55532.2022.9893253","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893253","url":null,"abstract":"This work presents a fast decision algorithm and its hardware design for the AV1 intra prediction, inspired on the direction detection algorithm used on the CDEF (Constrained Directional Enhancement Filter) of the same codec. The main objective is to reduce the number of intra candidates with a low-cost heuristic, thus allowing a faster prediction time in software and also allowing a low-area and low-power intra prediction hardware design. The proposed algorithm was implemented in the AV1 reference encoder (libaom) and, experiments showed, on average, a 22.56% encoding time reduction, at a cost of 1.26% BD-BR increase. The hardware design synthesis, targeting the TSMC 40 nm and frequency of 951 MHz, resulted in an area and power of 39K NAND2 gates and 4.92 mW, respectively. This target frequency is enough for the processing of UHD 4K (3,840x2,160 pixels) videos at 30 frames per second. When considering the integration of this hardware with a directional AV1 intra prediction hardware, a dynamic power dissipation reduction of up to 93% is expected.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129016588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893251
T. Martin, C. Barnes, G. Grewal, S. Areibi
This paper proposes a set of Machine-Learning (ML) probes that can be used at the placement step within the Verilog-to-Routing (VTR) tool. The proposed probes can pro-vide real-time feedback to the VTR placer guiding it towards more “router-friendly” placement solutions that result in the router performing fewer computationally expensive rip-up and re-route operations. In addition to enabling the previous strategies for reducing routing runtimes, the proposed probes can also be used to speed up architecture exploration by providing estimates of interconnect resource utilization on the Field Programmable Gate Array (FPGA) without incurring the computational cost of actually performing routing. Re-sults obtained indicate that the proposed ML probes not only improve upon all the VTR estimates in terms of wirelength, critical path delay and segmented wire utilization but also reduce the routing time of the tool.
{"title":"Integrating Machine-Learning Probes into the VTR FPGA Design Flow","authors":"T. Martin, C. Barnes, G. Grewal, S. Areibi","doi":"10.1109/SBCCI55532.2022.9893251","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893251","url":null,"abstract":"This paper proposes a set of Machine-Learning (ML) probes that can be used at the placement step within the Verilog-to-Routing (VTR) tool. The proposed probes can pro-vide real-time feedback to the VTR placer guiding it towards more “router-friendly” placement solutions that result in the router performing fewer computationally expensive rip-up and re-route operations. In addition to enabling the previous strategies for reducing routing runtimes, the proposed probes can also be used to speed up architecture exploration by providing estimates of interconnect resource utilization on the Field Programmable Gate Array (FPGA) without incurring the computational cost of actually performing routing. Re-sults obtained indicate that the proposed ML probes not only improve upon all the VTR estimates in terms of wirelength, critical path delay and segmented wire utilization but also reduce the routing time of the tool.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893234
Gabriel H. Eisenkraemer, L. Oliveira, E. Carara
Artificial Neural Networks (ANNs) have become the most popular machine learning technique for data processing, performing central functions in a wide variety of applications. In many cases, these models are used within constrained scenarios, in which a local execution of the algorithm is necessary to avoid latency and safety issues of remote computing (e.g, autonomous vehicles, edge devices in IoT networks). Even so, the known computational complexity of these models is still a challenge in such contexts, as implementation costs and performance requirements are difficult to balance. In these scenarios, pa-rameter quantization techniques are essential to simplifying the operations and memory footprint to make the hardware implementation more viable. In this paper, a case study is devised in which a convolutional neural network (CNN) architecture is fully implemented in hardware with three different optimization strategies, having parameters mapped to low bit-width fixed point integers with a power-of-two quantization scheme. Both ASIC and FPGA implementation flows are followed, allowing for an in-depth analysis of each circuit version. The obtained results show that the adopted quantization process enables optimizations on the implemented circuit, reducing about 50% of the circuitry area and 87.5% of the memory requirement. At the same time, the application performance was kept at the same level.
{"title":"Comparative Analysis of Hardware Implementations of a Convolutional Neural Network","authors":"Gabriel H. Eisenkraemer, L. Oliveira, E. Carara","doi":"10.1109/SBCCI55532.2022.9893234","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893234","url":null,"abstract":"Artificial Neural Networks (ANNs) have become the most popular machine learning technique for data processing, performing central functions in a wide variety of applications. In many cases, these models are used within constrained scenarios, in which a local execution of the algorithm is necessary to avoid latency and safety issues of remote computing (e.g, autonomous vehicles, edge devices in IoT networks). Even so, the known computational complexity of these models is still a challenge in such contexts, as implementation costs and performance requirements are difficult to balance. In these scenarios, pa-rameter quantization techniques are essential to simplifying the operations and memory footprint to make the hardware implementation more viable. In this paper, a case study is devised in which a convolutional neural network (CNN) architecture is fully implemented in hardware with three different optimization strategies, having parameters mapped to low bit-width fixed point integers with a power-of-two quantization scheme. Both ASIC and FPGA implementation flows are followed, allowing for an in-depth analysis of each circuit version. The obtained results show that the adopted quantization process enables optimizations on the implemented circuit, reducing about 50% of the circuitry area and 87.5% of the memory requirement. At the same time, the application performance was kept at the same level.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"39 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120894079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893227
O. Aiello, P. Crovetti, M. Alioto
In this paper, the tradeoff between conversion time and power in nW-power capacitance-to-digital converters (CDCs) is explored. The CDC in this work leverages the delay-power flexibility of dual-mode logic, is based on swappable oscillators and operates at nW power and low voltage down to 0.3 V without requiring any additional circuitry, reference or voltage regulation. Its self-calibration compensates PVT variations and mismatch at any point of the chip lifecycle, eliminating the need for trimming at testing time. Testchip demonstration of the CDC in 180nm shows that its power consumption can be dynamically adjusted from 1.37 nW down to 418 pW at a conversion time down to hundreds of ms. This makes the CDC suitable for harvested systems with very limited tight power budget and fluctuating voltage.
{"title":"Conversion Time-Power Tradeoff in Capacitance-to-Digital Converters with Dual-Mode Logic","authors":"O. Aiello, P. Crovetti, M. Alioto","doi":"10.1109/SBCCI55532.2022.9893227","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893227","url":null,"abstract":"In this paper, the tradeoff between conversion time and power in nW-power capacitance-to-digital converters (CDCs) is explored. The CDC in this work leverages the delay-power flexibility of dual-mode logic, is based on swappable oscillators and operates at nW power and low voltage down to 0.3 V without requiring any additional circuitry, reference or voltage regulation. Its self-calibration compensates PVT variations and mismatch at any point of the chip lifecycle, eliminating the need for trimming at testing time. Testchip demonstration of the CDC in 180nm shows that its power consumption can be dynamically adjusted from 1.37 nW down to 418 pW at a conversion time down to hundreds of ms. This makes the CDC suitable for harvested systems with very limited tight power budget and fluctuating voltage.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123985845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893220
Bruno Canal, H. Klimach, S. Bampi, T. Balen
This work presents an original SAR ADC architecture for low-power ADC applications. The proposed architecture uses a Time-to-Digital converter (TDC) to apply a window switching scheme in the SAR algorithm that predicts the switching value of the three MSB CDAC capacitors in just one SAR cycle. The switching scheme also implements a correlated-reversed switching (CRS), improving the converter linearity. The proposed archi-tecture is demonstrated on a 10-bit SAR ADC implementation, which takes ten SAR cycles to provide a l2-bit word to a digital error correction (DEC) block that translates it into a final 10-bit digital output. Considering a Gaussian random distribution to model the variability of unit capacitances, MATLAB simulations demonstrate an ADC linearity that achieves 52% of DNL and 69% of INL values of a conventional VCM-based switching method. The switching scheme reduces by 50% the average switching energy compared with the conventional VCM-based switching method, considering a design with the redundancy searching range of the implemented CDAC. The proposed SAR ADC architecture is designed and simulated in a 28nm CMOS technology. The proposed architecture, working with a 600mV power supply with 10MHz sample frequency, demonstrates an improvement of 28% in the ADC power dissipation compared with a 10-bit SAR ADC with traditional implementation designed to have the same linearity.
{"title":"Time Assisted SAR ADC with Bit-guess and Digital Error Correction","authors":"Bruno Canal, H. Klimach, S. Bampi, T. Balen","doi":"10.1109/SBCCI55532.2022.9893220","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893220","url":null,"abstract":"This work presents an original SAR ADC architecture for low-power ADC applications. The proposed architecture uses a Time-to-Digital converter (TDC) to apply a window switching scheme in the SAR algorithm that predicts the switching value of the three MSB CDAC capacitors in just one SAR cycle. The switching scheme also implements a correlated-reversed switching (CRS), improving the converter linearity. The proposed archi-tecture is demonstrated on a 10-bit SAR ADC implementation, which takes ten SAR cycles to provide a l2-bit word to a digital error correction (DEC) block that translates it into a final 10-bit digital output. Considering a Gaussian random distribution to model the variability of unit capacitances, MATLAB simulations demonstrate an ADC linearity that achieves 52% of DNL and 69% of INL values of a conventional VCM-based switching method. The switching scheme reduces by 50% the average switching energy compared with the conventional VCM-based switching method, considering a design with the redundancy searching range of the implemented CDAC. The proposed SAR ADC architecture is designed and simulated in a 28nm CMOS technology. The proposed architecture, working with a 600mV power supply with 10MHz sample frequency, demonstrates an improvement of 28% in the ADC power dissipation compared with a 10-bit SAR ADC with traditional implementation designed to have the same linearity.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116044305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893237
Susann Rothe, J. Lienig
The reliability of integrated circuits is increasingly endangered by migration-induced degradation of metal interconnects. The risk of failure due to migration is not only rising in every new technology node, it is also constraining the miniaturization of interconnect structures. In addition to DC lines, such as power delivery networks, signal and clock lines are increasingly being degraded by migration. This paper summarizes our current knowledge in avoiding migration-induced integrated-circuit failures. After introducing and discussing migration mechanisms, we focus on the growing electromigration susceptibility and the increasing influence of thermal migration. Looking forward, we review novel IC design strategies that incorporate migration constraints and mitigation measures into layout synthesis.
{"title":"Reliability by Design: Avoiding Migration-Induced Failure in IC Interconnects","authors":"Susann Rothe, J. Lienig","doi":"10.1109/SBCCI55532.2022.9893237","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893237","url":null,"abstract":"The reliability of integrated circuits is increasingly endangered by migration-induced degradation of metal interconnects. The risk of failure due to migration is not only rising in every new technology node, it is also constraining the miniaturization of interconnect structures. In addition to DC lines, such as power delivery networks, signal and clock lines are increasingly being degraded by migration. This paper summarizes our current knowledge in avoiding migration-induced integrated-circuit failures. After introducing and discussing migration mechanisms, we focus on the growing electromigration susceptibility and the increasing influence of thermal migration. Looking forward, we review novel IC design strategies that incorporate migration constraints and mitigation measures into layout synthesis.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130473833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-22DOI: 10.1109/SBCCI55532.2022.9893223
Tiago Knorst, Guilherme Korol, M. Jordan, J. Vicenzi, A. Lorenzon, M. B. Rutzig, A. C. S. Beck
Cloud Environments have been constantly adopting collaborative CPU-FPGA architectures to accelerate applications by partitioning the execution of their kernels across both devices. However, exploiting the optimization techniques that both archi-tectures offer is challenging, so they must be smartly employed depending on the application at hand and the target optimization (e.g., performance or energy). Given that, this work investigates the impact of collaboratively applying thread throttling (i.e. artificially decreasing the number of active threads) on the CPU side and HLS (High-Level Synthesis)-versioning on the FPGA side. We use a multi-tenant Cloud service as our object of study, where sequence of application requests with different priorities result in DAGs of application kernels that must be executed over the heterogeneous architecture. We show that by synergistically applying thread throttling and HLS-versioning to the incoming kernels may improve the Energy-Dealy product in up to 41x over the default and non-optimized execution.
{"title":"On the benefits of Collaborative Thread Throttling and HLS-Versioning in CPU-FPGA Environments","authors":"Tiago Knorst, Guilherme Korol, M. Jordan, J. Vicenzi, A. Lorenzon, M. B. Rutzig, A. C. S. Beck","doi":"10.1109/SBCCI55532.2022.9893223","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893223","url":null,"abstract":"Cloud Environments have been constantly adopting collaborative CPU-FPGA architectures to accelerate applications by partitioning the execution of their kernels across both devices. However, exploiting the optimization techniques that both archi-tectures offer is challenging, so they must be smartly employed depending on the application at hand and the target optimization (e.g., performance or energy). Given that, this work investigates the impact of collaboratively applying thread throttling (i.e. artificially decreasing the number of active threads) on the CPU side and HLS (High-Level Synthesis)-versioning on the FPGA side. We use a multi-tenant Cloud service as our object of study, where sequence of application requests with different priorities result in DAGs of application kernels that must be executed over the heterogeneous architecture. We show that by synergistically applying thread throttling and HLS-versioning to the incoming kernels may improve the Energy-Dealy product in up to 41x over the default and non-optimized execution.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132112365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}