首页 > 最新文献

2019 International SoC Design Conference (ISOCC)最新文献

英文 中文
Energy-efficient DNN-training with Stretchable DRAM Refresh Controller and Critical-bit Protection 基于可伸缩DRAM刷新控制器和关键位保护的高效dnn训练
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078532
Duy-Thanh Nguyen, I. Chang
Training DNN is a time-consuming process and requires intensive memory resources. Many software-based approaches were proposed to improve the performance and energy efficiency of inferring DNNs. Meanwhile training hardware is still received limited attention. In this work, we present a novel DRAM architecture with critical-bit protection. Our method targets main memory and graphical memory of the training system. Experimented on GEM5-GPGPUsim, our proposed DRAM architecture can achieve 23% and 12% DRAM energy reduction with floating point 32bit on main and graphical memories, respectively. Also, it further improves system's performance by 0.43˜ 4.12% while maintaining a negligible accuracy drops in training DNNs.
训练深度神经网络是一个耗时的过程,需要大量的内存资源。提出了许多基于软件的方法来提高推理深度神经网络的性能和能量效率。同时,硬件训练仍然受到有限的重视。在这项工作中,我们提出了一种具有临界位保护的新型DRAM架构。我们的方法以训练系统的主存储器和图形存储器为目标。在GEM5-GPGPUsim上进行实验,我们提出的DRAM架构可以在主存和图形存储器上分别实现浮点32位DRAM能耗降低23%和12%。此外,它进一步提高了系统性能0.43 ~ 4.12%,同时在训练dnn时保持可忽略不计的精度下降。
{"title":"Energy-efficient DNN-training with Stretchable DRAM Refresh Controller and Critical-bit Protection","authors":"Duy-Thanh Nguyen, I. Chang","doi":"10.1109/ISOCC47750.2019.9078532","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078532","url":null,"abstract":"Training DNN is a time-consuming process and requires intensive memory resources. Many software-based approaches were proposed to improve the performance and energy efficiency of inferring DNNs. Meanwhile training hardware is still received limited attention. In this work, we present a novel DRAM architecture with critical-bit protection. Our method targets main memory and graphical memory of the training system. Experimented on GEM5-GPGPUsim, our proposed DRAM architecture can achieve 23% and 12% DRAM energy reduction with floating point 32bit on main and graphical memories, respectively. Also, it further improves system's performance by 0.43˜ 4.12% while maintaining a negligible accuracy drops in training DNNs.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130436044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Function-Level Module Sharing in High-Level Synthesis 高级综合中的功能级模块共享
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078522
Ryohei Nozaki, Hiroki Nishikawa, Ittetsu Taniguchi, H. Tomiyama
High-Level Synthesis (HLS) which automatically synthesizes an RTL circuit from a behavioral description written in a high-level programming language such as C has now become popular in order to improve the design productivity. However, the area of HLS-generated circuits is often larger than that of human-designed ones. One of the reasons is that HLS tools often generate multiple instances of the same module from a C function. In this work, we propose a function-level module sharing technique in HLS.
高级合成(High-Level Synthesis, HLS)是一种利用C等高级编程语言编写的行为描述自动合成RTL电路的方法,它的出现是为了提高设计效率。然而,hls产生的电路的面积往往比人类设计的电路大。原因之一是HLS工具经常从一个C函数生成同一个模块的多个实例。在这项工作中,我们提出了一种功能级的HLS模块共享技术。
{"title":"Function-Level Module Sharing in High-Level Synthesis","authors":"Ryohei Nozaki, Hiroki Nishikawa, Ittetsu Taniguchi, H. Tomiyama","doi":"10.1109/ISOCC47750.2019.9078522","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078522","url":null,"abstract":"High-Level Synthesis (HLS) which automatically synthesizes an RTL circuit from a behavioral description written in a high-level programming language such as C has now become popular in order to improve the design productivity. However, the area of HLS-generated circuits is often larger than that of human-designed ones. One of the reasons is that HLS tools often generate multiple instances of the same module from a C function. In this work, we propose a function-level module sharing technique in HLS.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130314130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Throughput Improvement of an Autocorrelation Block for Time Synchronization in OFDM-based LiFi 基于ofdm的LiFi时间同步自相关块的吞吐量改进
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078499
Erwin Setiawan, T. Adiono
In this paper, the throughput improvement of an autocorrelation block is presented. The improvement is carried out by employing pipeline architecture. The autocorrelation block is used for time synchronization in OFDM-based Visible Light Communication (VLC) system. The autocorrelation block estimates the coarse time offset of the received OFDM data symbol. By adding pipeline registers, we can reduce the critical path of the combinational circuits by dividing it into smaller critical path, therefore the clock frequency can be increased. We show three times throughput improvement by using four stages pipeline architecture. The maximum clock frequency of the block is 188 MHz. The block has been implemented and verified on Xilinx Zynq-7000 FPGA.
本文提出了一种提高自相关块吞吐量的方法。采用流水线结构进行改进。在基于ofdm的可见光通信(VLC)系统中,自相关块用于时间同步。自相关块估计接收到的OFDM数据符号的粗时间偏移。通过增加流水线寄存器,我们可以将组合电路的关键路径分成更小的关键路径,从而降低其时钟频率。通过使用四阶段管道架构,我们展示了三倍的吞吐量改进。该块的最高时钟频率为188mhz。该模块已在Xilinx Zynq-7000 FPGA上实现并验证。
{"title":"Throughput Improvement of an Autocorrelation Block for Time Synchronization in OFDM-based LiFi","authors":"Erwin Setiawan, T. Adiono","doi":"10.1109/ISOCC47750.2019.9078499","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078499","url":null,"abstract":"In this paper, the throughput improvement of an autocorrelation block is presented. The improvement is carried out by employing pipeline architecture. The autocorrelation block is used for time synchronization in OFDM-based Visible Light Communication (VLC) system. The autocorrelation block estimates the coarse time offset of the received OFDM data symbol. By adding pipeline registers, we can reduce the critical path of the combinational circuits by dividing it into smaller critical path, therefore the clock frequency can be increased. We show three times throughput improvement by using four stages pipeline architecture. The maximum clock frequency of the block is 188 MHz. The block has been implemented and verified on Xilinx Zynq-7000 FPGA.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132700009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Maximum Error-Aware Design of Approximate Array Multipliers 近似阵列乘法器的最大误差感知设计
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078488
Kenta Shirane, Takahiro Yamamoto, Ittetsu Taniguchi, Yuko Hara-Azumi, S. Yamashita, H. Tomiyama
Approximate computing is considered as a processing approach to design of area-, power- or performance-efficient circuits. Approaches to approximate computing are to replace an exact arithmetic circuit with an approximate circuit. In this paper, we propose a methodology to systematically design a series of approximate array multipliers with different accuracy, area, power and delay. A circuit designer can select the one which satisfies the requirement on accuracy. Experimental results show the effectiveness of our approximate multipliers against existing approximate multipliers.
近似计算被认为是设计面积、功率或性能有效电路的一种处理方法。近似计算的方法是用近似电路代替精确算术电路。在本文中,我们提出了一种方法来系统地设计一系列具有不同精度、面积、功率和延迟的近似阵列乘法器。电路设计人员可以选择满足精度要求的电路。实验结果表明我们的近似乘法器对现有的近似乘法器是有效的。
{"title":"Maximum Error-Aware Design of Approximate Array Multipliers","authors":"Kenta Shirane, Takahiro Yamamoto, Ittetsu Taniguchi, Yuko Hara-Azumi, S. Yamashita, H. Tomiyama","doi":"10.1109/ISOCC47750.2019.9078488","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078488","url":null,"abstract":"Approximate computing is considered as a processing approach to design of area-, power- or performance-efficient circuits. Approaches to approximate computing are to replace an exact arithmetic circuit with an approximate circuit. In this paper, we propose a methodology to systematically design a series of approximate array multipliers with different accuracy, area, power and delay. A circuit designer can select the one which satisfies the requirement on accuracy. Experimental results show the effectiveness of our approximate multipliers against existing approximate multipliers.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"402 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133137328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Design of Low Inrush Current Low Dropout Regulator Using the Method of Pre-charging the Load 采用负荷预充法的低涌流低差稳压器的设计
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078498
Yong-Deok Ahn, Sungjae Oh, Sungjin Kim, Kangyoon Lee
This paper proposes a Low Inrush Current Low Dropout Regulator circuit using the method of pre-charging the load. The Low Dropout Regulator is a circuit to provide an efficient and stable supply voltage. One of the most important things of Low Dropout Regulator design is controlling the inrush current. The output voltage of the Low Dropout Regulator may be unstable when the inrush current is Large. The method proposed in this paper reduced inrush current from 438 mA to 5 mA. The designed Low Dropout Regulator has been implemented 55nm CMOS fabrication.
本文提出了一种采用负载预充电方法的低涌流低差稳压电路。低差稳压器是一种提供高效和稳定的电源电压的电路。低差稳压器设计中最重要的一环是控制浪涌电流。当浪涌电流较大时,低压差稳压器的输出电压可能不稳定。本文提出的方法将浪涌电流从438ma降低到5ma。设计的低差稳压器已实现55nm CMOS制造。
{"title":"A Design of Low Inrush Current Low Dropout Regulator Using the Method of Pre-charging the Load","authors":"Yong-Deok Ahn, Sungjae Oh, Sungjin Kim, Kangyoon Lee","doi":"10.1109/ISOCC47750.2019.9078498","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078498","url":null,"abstract":"This paper proposes a Low Inrush Current Low Dropout Regulator circuit using the method of pre-charging the load. The Low Dropout Regulator is a circuit to provide an efficient and stable supply voltage. One of the most important things of Low Dropout Regulator design is controlling the inrush current. The output voltage of the Low Dropout Regulator may be unstable when the inrush current is Large. The method proposed in this paper reduced inrush current from 438 mA to 5 mA. The designed Low Dropout Regulator has been implemented 55nm CMOS fabrication.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132997239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scaling Bit-Flexible Neural Networks 缩放位柔性神经网络
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078506
Yun-Nan Chang, Yu-Tang Tin
This paper proposes a neural network training scheme in order to obtain the network weights represented in the fixed-point number format such that under the different truncated lengths of the weights, our neural new network can all achieve near-optimized inference accuracy at the corresponding word-length. The similar idea has been explored before; however, the salient feature of our proposed scaling bit-progressive method is we have further taken into account the use and training of weight scaling factor, which can significant improve the inference accuracy. Our experimental results show that our trained Resnet- 18 neural network can improve the top-1 and top-5 accuracies of Tiny-ImageNet dataset by the average of 11.02% and 9.21% compared with the previous work without using the scaling factor. The top-1 and top-5 accuracy losses compared with float-point weights are only about 0.5% and 0.31% under the truncated size of 5-bit. The proposed method can be applied for neural network accelerators especially for those which support bit-serial processing.
为了获得以不动点数格式表示的网络权值,本文提出了一种神经网络训练方案,使我们的神经网络在权值截断长度不同的情况下,都能在相应的字长处达到近似优化的推理精度。类似的想法以前也有人提出过;然而,我们提出的缩放位递进方法的显著特点是我们进一步考虑了权重缩放因子的使用和训练,可以显著提高推理精度。实验结果表明,在不使用比例因子的情况下,我们训练的Resnet- 18神经网络可以将Tiny-ImageNet数据集top-1和top-5的准确率平均提高11.02%和9.21%。截断尺寸为5位的情况下,top-1和top-5的精度损失与浮点权值相比仅为0.5%和0.31%左右。该方法适用于神经网络加速器,特别是支持位串行处理的神经网络加速器。
{"title":"Scaling Bit-Flexible Neural Networks","authors":"Yun-Nan Chang, Yu-Tang Tin","doi":"10.1109/ISOCC47750.2019.9078506","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078506","url":null,"abstract":"This paper proposes a neural network training scheme in order to obtain the network weights represented in the fixed-point number format such that under the different truncated lengths of the weights, our neural new network can all achieve near-optimized inference accuracy at the corresponding word-length. The similar idea has been explored before; however, the salient feature of our proposed scaling bit-progressive method is we have further taken into account the use and training of weight scaling factor, which can significant improve the inference accuracy. Our experimental results show that our trained Resnet- 18 neural network can improve the top-1 and top-5 accuracies of Tiny-ImageNet dataset by the average of 11.02% and 9.21% compared with the previous work without using the scaling factor. The top-1 and top-5 accuracy losses compared with float-point weights are only about 0.5% and 0.31% under the truncated size of 5-bit. The proposed method can be applied for neural network accelerators especially for those which support bit-serial processing.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116072058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A New Scan Chain Reordering Method for Low Power Consumption based on Care Bit Density 一种基于关心位密度的低功耗扫描链重排序新方法
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078527
K. Cho, Jihye Kim, Hyunggoy Oh, Sangjun Lee, Sungho Kang
Scan-based testing, though widely used in modern digital designs, causes more power consumption than in functional mode. This excessive power consumption can cause severe hazards such as circuit reliability and yield loss. To solve this problem, a new scan chain reordering method based on care bit density is proposed in this paper. This proposed method helps merging care bits toward the front end of scan chains. Thus, it can reduce scan cell switching activities during scan shift operation. Experimental results on ISCAS’89 benchmark circuits show that the proposed scan chain reordering method reduces test power consumption compared to the previous work.
基于扫描的测试虽然广泛应用于现代数字设计中,但其功耗高于功能模式。这种过度的功耗可能会造成严重的危害,如电路可靠性和产量损失。为了解决这一问题,本文提出了一种基于关心位密度的扫描链重排序方法。该方法有助于将关心位合并到扫描链的前端。因此,它可以减少扫描移位操作期间的扫描单元切换活动。在ISCAS’89基准电路上的实验结果表明,所提出的扫描链重排序方法比以前的方法降低了测试功耗。
{"title":"A New Scan Chain Reordering Method for Low Power Consumption based on Care Bit Density","authors":"K. Cho, Jihye Kim, Hyunggoy Oh, Sangjun Lee, Sungho Kang","doi":"10.1109/ISOCC47750.2019.9078527","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078527","url":null,"abstract":"Scan-based testing, though widely used in modern digital designs, causes more power consumption than in functional mode. This excessive power consumption can cause severe hazards such as circuit reliability and yield loss. To solve this problem, a new scan chain reordering method based on care bit density is proposed in this paper. This proposed method helps merging care bits toward the front end of scan chains. Thus, it can reduce scan cell switching activities during scan shift operation. Experimental results on ISCAS’89 benchmark circuits show that the proposed scan chain reordering method reduces test power consumption compared to the previous work.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Model-based Parallelization for Simulink Models on Multicore CPUs and GPUs 多核cpu和gpu上Simulink模型基于模型的并行化
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078489
Zhaoqian Zhong, M. Edahiro
In this paper, we propose a model-based approach to parallelize Simulink models on multicore CPUs and NVIDIA GPUs at the block level and generate CUDA C codes for parallel execution. In our proposed approach, the Simulink models are converted to directed acyclic graphs (DAGs) based on their block diagrams, wherein the nodes represent tasks of grouped blocks in the model and the edges represent the communication behaviors between blocks. Next, a path analysis is conducted on the DAGs to extract all execution paths and calculate the length of each path, which comprises the execution times of tasks and the communication times of edges on the path. Then, an integer linear programming (ILP) formulation is used to minimize the length of the critical path of the DAG, which represents the execution time of the Simulink model. The ILP formulation also balances the workloads on each CPU core for optimized hardware utilization. We evaluate the proposed approach by parallelizing an image processing model on a platform of two homogeneous CPU cores and two GPUs to determine its effectiveness.
在本文中,我们提出了一种基于模型的方法,在块级上并行化多核cpu和NVIDIA gpu上的Simulink模型,并生成并行执行的CUDA C代码。在我们提出的方法中,将Simulink模型转换为基于其框图的有向无环图(dag),其中节点表示模型中分组块的任务,边表示块之间的通信行为。然后对dag进行路径分析,提取所有的执行路径,并计算每条路径的长度,包括任务的执行次数和路径上边的通信次数。然后,使用整数线性规划(ILP)公式来最小化DAG关键路径的长度,该长度表示Simulink模型的执行时间。ILP公式还平衡每个CPU核心上的工作负载,以优化硬件利用率。我们通过在两个同构CPU内核和两个gpu的平台上并行化图像处理模型来评估所提出的方法,以确定其有效性。
{"title":"Model-based Parallelization for Simulink Models on Multicore CPUs and GPUs","authors":"Zhaoqian Zhong, M. Edahiro","doi":"10.1109/ISOCC47750.2019.9078489","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078489","url":null,"abstract":"In this paper, we propose a model-based approach to parallelize Simulink models on multicore CPUs and NVIDIA GPUs at the block level and generate CUDA C codes for parallel execution. In our proposed approach, the Simulink models are converted to directed acyclic graphs (DAGs) based on their block diagrams, wherein the nodes represent tasks of grouped blocks in the model and the edges represent the communication behaviors between blocks. Next, a path analysis is conducted on the DAGs to extract all execution paths and calculate the length of each path, which comprises the execution times of tasks and the communication times of edges on the path. Then, an integer linear programming (ILP) formulation is used to minimize the length of the critical path of the DAG, which represents the execution time of the Simulink model. The ILP formulation also balances the workloads on each CPU core for optimized hardware utilization. We evaluate the proposed approach by parallelizing an image processing model on a platform of two homogeneous CPU cores and two GPUs to determine its effectiveness.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130177667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Runtime Estimation Model Based Graph Partitioning for Parallel Custom Instruction Selection 基于运行时估计模型的并行自定义指令选择图划分
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078512
Chenglong Xiao, Shanshan Wang, Wanjun Liu, Haicheng Qu, Xinlin Wang
Custom instruction selection is one of the most com- putationally difficult problems involved in the custom instruction identification for application-specific instruction-set processors. Most of existing research try to solve the custom instruction selection problem using sequential algorithms on a single compute node. Considering the high complexity of the problem, this paper proposes an efficient parallel method based on multi-depth graph partitioning for selecting custom instruction. Experimental result- s show that the proposed parallel custom instruction selection method outperforms two of the latest parallel methods and can achieve near-linear speedup.
自定义指令选择是应用特定指令集处理器自定义指令识别中最困难的计算机问题之一。现有的研究大多是在单个计算节点上使用顺序算法来解决自定义指令选择问题。针对该问题的高复杂性,提出了一种基于多深度图划分的高效并行自定义指令选择方法。实验结果表明,所提出的并行自定义指令选择方法优于两种最新的并行方法,并能实现近线性加速。
{"title":"Runtime Estimation Model Based Graph Partitioning for Parallel Custom Instruction Selection","authors":"Chenglong Xiao, Shanshan Wang, Wanjun Liu, Haicheng Qu, Xinlin Wang","doi":"10.1109/ISOCC47750.2019.9078512","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078512","url":null,"abstract":"Custom instruction selection is one of the most com- putationally difficult problems involved in the custom instruction identification for application-specific instruction-set processors. Most of existing research try to solve the custom instruction selection problem using sequential algorithms on a single compute node. Considering the high complexity of the problem, this paper proposes an efficient parallel method based on multi-depth graph partitioning for selecting custom instruction. Experimental result- s show that the proposed parallel custom instruction selection method outperforms two of the latest parallel methods and can achieve near-linear speedup.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132567592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI 32TFLOPS Autonomous Driving Processor on AI-Ware with Adaptive Power Saving 基于AI- ware的32TFLOPS自适应节能自动驾驶处理器
Pub Date : 2019-10-06 DOI: 10.1109/ISOCC47750.2019.9078533
Youngsu Kwon, Yong Cheol Peter Cho, Jeongmin Yang, Jaehoon Chung, Kyoung-Seon Shin, Jinho Han, Chan Kim, C. Lyuh, Hyun-Mi Kim, I. S. Jeon, Minseok Choi
AI processors are extending the application area into mobile and edge devices. The requirement of low power consumption which has been an essential factor in designing processors is now becoming the most critical factor for mobile AI processors to be viable. The high performance requirement exacerbates the energy crisis caused by a large area due to a lot of processing engines required for implementing AI processors. We present the design of an AI processor targeting both CNN and MLP processing in autonomous vehicles. The proposed AI processor integrates Super-Thread-Core composed of 16384 nano cores in mesh-grid network for neural network acceleration. The performance of the processor reaches 32 Tera FLOPS enabling hyper real-time execution of CNN and MLP. Each nano core is programmable by a sequence of instructions compiled from the neural network description by the proprietary AI-Ware. The mesh-array of nano cores at the heart of neural computing accounts for most of the power consumption. The AI-ware compiler enables adaptive power gating by dynamically compiling the commands based on the temperature profile reducing 50% of total power consumption.
人工智能处理器正在将应用领域扩展到移动和边缘设备。低功耗的要求一直是设计处理器的基本因素,现在正成为移动人工智能处理器可行的最关键因素。高性能要求加剧了实现AI处理器所需的大量处理引擎所造成的大面积能源危机。我们提出了一种针对自动驾驶汽车中CNN和MLP处理的AI处理器的设计。提出的AI处理器在网格网络中集成了由16384个纳米核组成的Super-Thread-Core,用于神经网络加速。处理器性能达到32 Tera FLOPS,可实现CNN和MLP的超实时执行。每个纳米核心是可编程的指令序列编译从神经网络描述由专有的AI-Ware。作为神经计算核心的纳米核网格阵列占据了大部分的功耗。AI-ware编译器通过根据温度曲线动态编译命令,实现自适应功率门控,可降低总功耗50%。
{"title":"AI 32TFLOPS Autonomous Driving Processor on AI-Ware with Adaptive Power Saving","authors":"Youngsu Kwon, Yong Cheol Peter Cho, Jeongmin Yang, Jaehoon Chung, Kyoung-Seon Shin, Jinho Han, Chan Kim, C. Lyuh, Hyun-Mi Kim, I. S. Jeon, Minseok Choi","doi":"10.1109/ISOCC47750.2019.9078533","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078533","url":null,"abstract":"AI processors are extending the application area into mobile and edge devices. The requirement of low power consumption which has been an essential factor in designing processors is now becoming the most critical factor for mobile AI processors to be viable. The high performance requirement exacerbates the energy crisis caused by a large area due to a lot of processing engines required for implementing AI processors. We present the design of an AI processor targeting both CNN and MLP processing in autonomous vehicles. The proposed AI processor integrates Super-Thread-Core composed of 16384 nano cores in mesh-grid network for neural network acceleration. The performance of the processor reaches 32 Tera FLOPS enabling hyper real-time execution of CNN and MLP. Each nano core is programmable by a sequence of instructions compiled from the neural network description by the proprietary AI-Ware. The mesh-array of nano cores at the heart of neural computing accounts for most of the power consumption. The AI-ware compiler enables adaptive power gating by dynamically compiling the commands based on the temperature profile reducing 50% of total power consumption.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2019 International SoC Design Conference (ISOCC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1