首页 > 最新文献

2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)最新文献

英文 中文
Direction-Based Fast Mode Decision and Hardware Design for the AV1 Intra Prediction 基于方向的AV1内部预测快速模式判定及硬件设计
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893253
M. Corrêa, D. Palomino, G. Corrêa, L. Agostini
This work presents a fast decision algorithm and its hardware design for the AV1 intra prediction, inspired on the direction detection algorithm used on the CDEF (Constrained Directional Enhancement Filter) of the same codec. The main objective is to reduce the number of intra candidates with a low-cost heuristic, thus allowing a faster prediction time in software and also allowing a low-area and low-power intra prediction hardware design. The proposed algorithm was implemented in the AV1 reference encoder (libaom) and, experiments showed, on average, a 22.56% encoding time reduction, at a cost of 1.26% BD-BR increase. The hardware design synthesis, targeting the TSMC 40 nm and frequency of 951 MHz, resulted in an area and power of 39K NAND2 gates and 4.92 mW, respectively. This target frequency is enough for the processing of UHD 4K (3,840x2,160 pixels) videos at 30 frames per second. When considering the integration of this hardware with a directional AV1 intra prediction hardware, a dynamic power dissipation reduction of up to 93% is expected.
本文提出了一种用于AV1帧内预测的快速决策算法及其硬件设计,灵感来自于同一编解码器的CDEF(约束方向增强滤波器)上使用的方向检测算法。主要目标是通过低成本的启发式方法减少候选样本的数量,从而在软件中实现更快的预测时间,并实现低面积和低功耗的内部预测硬件设计。实验结果表明,该算法在AV1参考编码器(libaom)上实现,平均减少了22.56%的编码时间,而BD-BR增加了1.26%。硬件设计综合以台积电40 nm和951 MHz频率为目标,产生的NAND2栅极面积和功率分别为39K和4.92 mW。这个目标频率足以以每秒30帧的速度处理UHD 4K (3840 × 2160像素)视频。当考虑将该硬件与方向AV1内部预测硬件集成时,预计动态功耗降低高达93%。
{"title":"Direction-Based Fast Mode Decision and Hardware Design for the AV1 Intra Prediction","authors":"M. Corrêa, D. Palomino, G. Corrêa, L. Agostini","doi":"10.1109/SBCCI55532.2022.9893253","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893253","url":null,"abstract":"This work presents a fast decision algorithm and its hardware design for the AV1 intra prediction, inspired on the direction detection algorithm used on the CDEF (Constrained Directional Enhancement Filter) of the same codec. The main objective is to reduce the number of intra candidates with a low-cost heuristic, thus allowing a faster prediction time in software and also allowing a low-area and low-power intra prediction hardware design. The proposed algorithm was implemented in the AV1 reference encoder (libaom) and, experiments showed, on average, a 22.56% encoding time reduction, at a cost of 1.26% BD-BR increase. The hardware design synthesis, targeting the TSMC 40 nm and frequency of 951 MHz, resulted in an area and power of 39K NAND2 gates and 4.92 mW, respectively. This target frequency is enough for the processing of UHD 4K (3,840x2,160 pixels) videos at 30 frames per second. When considering the integration of this hardware with a directional AV1 intra prediction hardware, a dynamic power dissipation reduction of up to 93% is expected.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129016588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Integrating Machine-Learning Probes into the VTR FPGA Design Flow 将机器学习探针集成到VTR FPGA设计流程中
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893251
T. Martin, C. Barnes, G. Grewal, S. Areibi
This paper proposes a set of Machine-Learning (ML) probes that can be used at the placement step within the Verilog-to-Routing (VTR) tool. The proposed probes can pro-vide real-time feedback to the VTR placer guiding it towards more “router-friendly” placement solutions that result in the router performing fewer computationally expensive rip-up and re-route operations. In addition to enabling the previous strategies for reducing routing runtimes, the proposed probes can also be used to speed up architecture exploration by providing estimates of interconnect resource utilization on the Field Programmable Gate Array (FPGA) without incurring the computational cost of actually performing routing. Re-sults obtained indicate that the proposed ML probes not only improve upon all the VTR estimates in terms of wirelength, critical path delay and segmented wire utilization but also reduce the routing time of the tool.
本文提出了一组机器学习(ML)探针,可用于Verilog-to-Routing (VTR)工具中的放置步骤。所提出的探针可以向VTR放置器提供实时反馈,指导其朝着更“路由器友好”的放置解决方案,从而使路由器执行更少的计算昂贵的撕裂和重新路由操作。除了支持前面减少路由运行时间的策略外,所提出的探针还可以通过提供对现场可编程门阵列(FPGA)上互连资源利用率的估计来加速架构探索,而不会产生实际执行路由的计算成本。结果表明,所提出的ML探针不仅在无线长度、关键路径延迟和分段导线利用率方面改善了所有VTR估计,而且还减少了工具的布线时间。
{"title":"Integrating Machine-Learning Probes into the VTR FPGA Design Flow","authors":"T. Martin, C. Barnes, G. Grewal, S. Areibi","doi":"10.1109/SBCCI55532.2022.9893251","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893251","url":null,"abstract":"This paper proposes a set of Machine-Learning (ML) probes that can be used at the placement step within the Verilog-to-Routing (VTR) tool. The proposed probes can pro-vide real-time feedback to the VTR placer guiding it towards more “router-friendly” placement solutions that result in the router performing fewer computationally expensive rip-up and re-route operations. In addition to enabling the previous strategies for reducing routing runtimes, the proposed probes can also be used to speed up architecture exploration by providing estimates of interconnect resource utilization on the Field Programmable Gate Array (FPGA) without incurring the computational cost of actually performing routing. Re-sults obtained indicate that the proposed ML probes not only improve upon all the VTR estimates in terms of wirelength, critical path delay and segmented wire utilization but also reduce the routing time of the tool.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Comparative Analysis of Hardware Implementations of a Convolutional Neural Network 卷积神经网络硬件实现的比较分析
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893234
Gabriel H. Eisenkraemer, L. Oliveira, E. Carara
Artificial Neural Networks (ANNs) have become the most popular machine learning technique for data processing, performing central functions in a wide variety of applications. In many cases, these models are used within constrained scenarios, in which a local execution of the algorithm is necessary to avoid latency and safety issues of remote computing (e.g, autonomous vehicles, edge devices in IoT networks). Even so, the known computational complexity of these models is still a challenge in such contexts, as implementation costs and performance requirements are difficult to balance. In these scenarios, pa-rameter quantization techniques are essential to simplifying the operations and memory footprint to make the hardware implementation more viable. In this paper, a case study is devised in which a convolutional neural network (CNN) architecture is fully implemented in hardware with three different optimization strategies, having parameters mapped to low bit-width fixed point integers with a power-of-two quantization scheme. Both ASIC and FPGA implementation flows are followed, allowing for an in-depth analysis of each circuit version. The obtained results show that the adopted quantization process enables optimizations on the implemented circuit, reducing about 50% of the circuitry area and 87.5% of the memory requirement. At the same time, the application performance was kept at the same level.
人工神经网络(ann)已经成为最流行的数据处理机器学习技术,在各种应用中发挥核心作用。在许多情况下,这些模型在受限的场景中使用,在这些场景中,算法的本地执行是必要的,以避免远程计算的延迟和安全问题(例如,自动驾驶汽车,物联网网络中的边缘设备)。即便如此,在这种情况下,这些模型已知的计算复杂性仍然是一个挑战,因为实现成本和性能需求很难平衡。在这些场景中,参数量化技术对于简化操作和内存占用至关重要,从而使硬件实现更加可行。在本文中,设计了一个案例研究,其中卷积神经网络(CNN)架构在硬件上完全实现,采用三种不同的优化策略,将参数映射到低位宽不动点整数,并采用2次幂量化方案。遵循ASIC和FPGA实现流程,允许对每个电路版本进行深入分析。结果表明,所采用的量化过程可以优化所实现的电路,减少约50%的电路面积和87.5%的内存需求。同时,应用程序性能保持在同一水平。
{"title":"Comparative Analysis of Hardware Implementations of a Convolutional Neural Network","authors":"Gabriel H. Eisenkraemer, L. Oliveira, E. Carara","doi":"10.1109/SBCCI55532.2022.9893234","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893234","url":null,"abstract":"Artificial Neural Networks (ANNs) have become the most popular machine learning technique for data processing, performing central functions in a wide variety of applications. In many cases, these models are used within constrained scenarios, in which a local execution of the algorithm is necessary to avoid latency and safety issues of remote computing (e.g, autonomous vehicles, edge devices in IoT networks). Even so, the known computational complexity of these models is still a challenge in such contexts, as implementation costs and performance requirements are difficult to balance. In these scenarios, pa-rameter quantization techniques are essential to simplifying the operations and memory footprint to make the hardware implementation more viable. In this paper, a case study is devised in which a convolutional neural network (CNN) architecture is fully implemented in hardware with three different optimization strategies, having parameters mapped to low bit-width fixed point integers with a power-of-two quantization scheme. Both ASIC and FPGA implementation flows are followed, allowing for an in-depth analysis of each circuit version. The obtained results show that the adopted quantization process enables optimizations on the implemented circuit, reducing about 50% of the circuitry area and 87.5% of the memory requirement. At the same time, the application performance was kept at the same level.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"39 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120894079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conversion Time-Power Tradeoff in Capacitance-to-Digital Converters with Dual-Mode Logic 双模逻辑电容-数字转换器的转换时间-功率权衡
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893227
O. Aiello, P. Crovetti, M. Alioto
In this paper, the tradeoff between conversion time and power in nW-power capacitance-to-digital converters (CDCs) is explored. The CDC in this work leverages the delay-power flexibility of dual-mode logic, is based on swappable oscillators and operates at nW power and low voltage down to 0.3 V without requiring any additional circuitry, reference or voltage regulation. Its self-calibration compensates PVT variations and mismatch at any point of the chip lifecycle, eliminating the need for trimming at testing time. Testchip demonstration of the CDC in 180nm shows that its power consumption can be dynamically adjusted from 1.37 nW down to 418 pW at a conversion time down to hundreds of ms. This makes the CDC suitable for harvested systems with very limited tight power budget and fluctuating voltage.
本文研究了nw功率电容-数字转换器(CDCs)中转换时间和功率的权衡问题。这项工作中的CDC利用了双模逻辑的延迟功率灵活性,基于可切换振荡器,在nW功率和低至0.3 V的低电压下工作,无需任何额外的电路,参考或电压调节。它的自校准补偿PVT变化和不匹配在芯片生命周期的任何点,消除了在测试时修剪的需要。CDC在180nm的测试芯片演示表明,其功耗可以从1.37 nW动态调整到418 pW,转换时间低至数百ms,这使得CDC适用于功率预算非常有限和电压波动的采集系统。
{"title":"Conversion Time-Power Tradeoff in Capacitance-to-Digital Converters with Dual-Mode Logic","authors":"O. Aiello, P. Crovetti, M. Alioto","doi":"10.1109/SBCCI55532.2022.9893227","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893227","url":null,"abstract":"In this paper, the tradeoff between conversion time and power in nW-power capacitance-to-digital converters (CDCs) is explored. The CDC in this work leverages the delay-power flexibility of dual-mode logic, is based on swappable oscillators and operates at nW power and low voltage down to 0.3 V without requiring any additional circuitry, reference or voltage regulation. Its self-calibration compensates PVT variations and mismatch at any point of the chip lifecycle, eliminating the need for trimming at testing time. Testchip demonstration of the CDC in 180nm shows that its power consumption can be dynamically adjusted from 1.37 nW down to 418 pW at a conversion time down to hundreds of ms. This makes the CDC suitable for harvested systems with very limited tight power budget and fluctuating voltage.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123985845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Time Assisted SAR ADC with Bit-guess and Digital Error Correction 具有位猜测和数字纠错的时间辅助SAR ADC
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893220
Bruno Canal, H. Klimach, S. Bampi, T. Balen
This work presents an original SAR ADC architecture for low-power ADC applications. The proposed architecture uses a Time-to-Digital converter (TDC) to apply a window switching scheme in the SAR algorithm that predicts the switching value of the three MSB CDAC capacitors in just one SAR cycle. The switching scheme also implements a correlated-reversed switching (CRS), improving the converter linearity. The proposed archi-tecture is demonstrated on a 10-bit SAR ADC implementation, which takes ten SAR cycles to provide a l2-bit word to a digital error correction (DEC) block that translates it into a final 10-bit digital output. Considering a Gaussian random distribution to model the variability of unit capacitances, MATLAB simulations demonstrate an ADC linearity that achieves 52% of DNL and 69% of INL values of a conventional VCM-based switching method. The switching scheme reduces by 50% the average switching energy compared with the conventional VCM-based switching method, considering a design with the redundancy searching range of the implemented CDAC. The proposed SAR ADC architecture is designed and simulated in a 28nm CMOS technology. The proposed architecture, working with a 600mV power supply with 10MHz sample frequency, demonstrates an improvement of 28% in the ADC power dissipation compared with a 10-bit SAR ADC with traditional implementation designed to have the same linearity.
这项工作提出了一种用于低功耗ADC应用的原始SAR ADC架构。该架构使用时间-数字转换器(TDC)在SAR算法中应用窗口切换方案,预测三个MSB CDAC电容器在一个SAR周期内的切换值。该开关方案还实现了相关反向开关(CRS),提高了变换器的线性度。提出的架构在一个10位SAR ADC实现上进行了演示,该实现需要10个SAR周期来提供一个12位字到数字纠错(DEC)块,该块将其转换为最终的10位数字输出。考虑高斯随机分布来模拟单位电容的可变性,MATLAB仿真证明了ADC线性度达到传统基于vcm的开关方法的52% DNL和69% INL值。考虑到所实现的CDAC的冗余搜索范围,与传统的基于vcm的切换方法相比,该切换方案的平均开关能量降低了50%。采用28nm CMOS技术设计并仿真了所提出的SAR ADC架构。所提出的架构在600mV电源和10MHz采样频率下工作,与具有相同线性度的传统实现的10位SAR ADC相比,ADC功耗提高了28%。
{"title":"Time Assisted SAR ADC with Bit-guess and Digital Error Correction","authors":"Bruno Canal, H. Klimach, S. Bampi, T. Balen","doi":"10.1109/SBCCI55532.2022.9893220","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893220","url":null,"abstract":"This work presents an original SAR ADC architecture for low-power ADC applications. The proposed architecture uses a Time-to-Digital converter (TDC) to apply a window switching scheme in the SAR algorithm that predicts the switching value of the three MSB CDAC capacitors in just one SAR cycle. The switching scheme also implements a correlated-reversed switching (CRS), improving the converter linearity. The proposed archi-tecture is demonstrated on a 10-bit SAR ADC implementation, which takes ten SAR cycles to provide a l2-bit word to a digital error correction (DEC) block that translates it into a final 10-bit digital output. Considering a Gaussian random distribution to model the variability of unit capacitances, MATLAB simulations demonstrate an ADC linearity that achieves 52% of DNL and 69% of INL values of a conventional VCM-based switching method. The switching scheme reduces by 50% the average switching energy compared with the conventional VCM-based switching method, considering a design with the redundancy searching range of the implemented CDAC. The proposed SAR ADC architecture is designed and simulated in a 28nm CMOS technology. The proposed architecture, working with a 600mV power supply with 10MHz sample frequency, demonstrates an improvement of 28% in the ADC power dissipation compared with a 10-bit SAR ADC with traditional implementation designed to have the same linearity.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116044305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliability by Design: Avoiding Migration-Induced Failure in IC Interconnects 设计可靠性:避免集成电路互连中迁移引起的故障
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893237
Susann Rothe, J. Lienig
The reliability of integrated circuits is increasingly endangered by migration-induced degradation of metal interconnects. The risk of failure due to migration is not only rising in every new technology node, it is also constraining the miniaturization of interconnect structures. In addition to DC lines, such as power delivery networks, signal and clock lines are increasingly being degraded by migration. This paper summarizes our current knowledge in avoiding migration-induced integrated-circuit failures. After introducing and discussing migration mechanisms, we focus on the growing electromigration susceptibility and the increasing influence of thermal migration. Looking forward, we review novel IC design strategies that incorporate migration constraints and mitigation measures into layout synthesis.
集成电路的可靠性日益受到金属互连迁移退化的威胁。由于迁移而导致的故障风险不仅在每个新技术节点中都在上升,而且还限制了互连结构的小型化。除了直流线路,例如电力输送网络,信号和时钟线路也越来越受到迁移的影响。本文总结了目前在避免迁移引起的集成电路故障方面的知识。在介绍和讨论了迁移机制之后,我们重点讨论了电迁移敏感性的增加和热迁移的影响。展望未来,我们回顾了将迁移约束和缓解措施纳入布局综合的新型IC设计策略。
{"title":"Reliability by Design: Avoiding Migration-Induced Failure in IC Interconnects","authors":"Susann Rothe, J. Lienig","doi":"10.1109/SBCCI55532.2022.9893237","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893237","url":null,"abstract":"The reliability of integrated circuits is increasingly endangered by migration-induced degradation of metal interconnects. The risk of failure due to migration is not only rising in every new technology node, it is also constraining the miniaturization of interconnect structures. In addition to DC lines, such as power delivery networks, signal and clock lines are increasingly being degraded by migration. This paper summarizes our current knowledge in avoiding migration-induced integrated-circuit failures. After introducing and discussing migration mechanisms, we focus on the growing electromigration susceptibility and the increasing influence of thermal migration. Looking forward, we review novel IC design strategies that incorporate migration constraints and mitigation measures into layout synthesis.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130473833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the benefits of Collaborative Thread Throttling and HLS-Versioning in CPU-FPGA Environments CPU-FPGA环境下协同线程节流和hls -版本控制的好处
Pub Date : 2022-08-22 DOI: 10.1109/SBCCI55532.2022.9893223
Tiago Knorst, Guilherme Korol, M. Jordan, J. Vicenzi, A. Lorenzon, M. B. Rutzig, A. C. S. Beck
Cloud Environments have been constantly adopting collaborative CPU-FPGA architectures to accelerate applications by partitioning the execution of their kernels across both devices. However, exploiting the optimization techniques that both archi-tectures offer is challenging, so they must be smartly employed depending on the application at hand and the target optimization (e.g., performance or energy). Given that, this work investigates the impact of collaboratively applying thread throttling (i.e. artificially decreasing the number of active threads) on the CPU side and HLS (High-Level Synthesis)-versioning on the FPGA side. We use a multi-tenant Cloud service as our object of study, where sequence of application requests with different priorities result in DAGs of application kernels that must be executed over the heterogeneous architecture. We show that by synergistically applying thread throttling and HLS-versioning to the incoming kernels may improve the Energy-Dealy product in up to 41x over the default and non-optimized execution.
云环境一直在不断采用协同CPU-FPGA架构,通过在两个设备上划分内核的执行来加速应用程序。然而,利用这两种体系结构提供的优化技术是具有挑战性的,因此必须根据手头的应用程序和目标优化(例如,性能或能源)巧妙地使用它们。鉴于此,本研究调查了CPU端协同应用线程节流(即人为地减少活动线程的数量)和FPGA端HLS(高级综合)版本控制的影响。我们使用多租户云服务作为研究对象,其中具有不同优先级的应用程序请求序列导致应用程序内核的dag,这些dag必须在异构体系结构上执行。我们表明,通过对传入内核协同应用线程节流和hls版本控制,可以将Energy-Dealy产品的性能提高到默认和非优化执行的41倍。
{"title":"On the benefits of Collaborative Thread Throttling and HLS-Versioning in CPU-FPGA Environments","authors":"Tiago Knorst, Guilherme Korol, M. Jordan, J. Vicenzi, A. Lorenzon, M. B. Rutzig, A. C. S. Beck","doi":"10.1109/SBCCI55532.2022.9893223","DOIUrl":"https://doi.org/10.1109/SBCCI55532.2022.9893223","url":null,"abstract":"Cloud Environments have been constantly adopting collaborative CPU-FPGA architectures to accelerate applications by partitioning the execution of their kernels across both devices. However, exploiting the optimization techniques that both archi-tectures offer is challenging, so they must be smartly employed depending on the application at hand and the target optimization (e.g., performance or energy). Given that, this work investigates the impact of collaboratively applying thread throttling (i.e. artificially decreasing the number of active threads) on the CPU side and HLS (High-Level Synthesis)-versioning on the FPGA side. We use a multi-tenant Cloud service as our object of study, where sequence of application requests with different priorities result in DAGs of application kernels that must be executed over the heterogeneous architecture. We show that by synergistically applying thread throttling and HLS-versioning to the incoming kernels may improve the Energy-Dealy product in up to 41x over the default and non-optimized execution.","PeriodicalId":231587,"journal":{"name":"2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132112365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1