Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078532
Duy-Thanh Nguyen, I. Chang
Training DNN is a time-consuming process and requires intensive memory resources. Many software-based approaches were proposed to improve the performance and energy efficiency of inferring DNNs. Meanwhile training hardware is still received limited attention. In this work, we present a novel DRAM architecture with critical-bit protection. Our method targets main memory and graphical memory of the training system. Experimented on GEM5-GPGPUsim, our proposed DRAM architecture can achieve 23% and 12% DRAM energy reduction with floating point 32bit on main and graphical memories, respectively. Also, it further improves system's performance by 0.43˜ 4.12% while maintaining a negligible accuracy drops in training DNNs.
{"title":"Energy-efficient DNN-training with Stretchable DRAM Refresh Controller and Critical-bit Protection","authors":"Duy-Thanh Nguyen, I. Chang","doi":"10.1109/ISOCC47750.2019.9078532","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078532","url":null,"abstract":"Training DNN is a time-consuming process and requires intensive memory resources. Many software-based approaches were proposed to improve the performance and energy efficiency of inferring DNNs. Meanwhile training hardware is still received limited attention. In this work, we present a novel DRAM architecture with critical-bit protection. Our method targets main memory and graphical memory of the training system. Experimented on GEM5-GPGPUsim, our proposed DRAM architecture can achieve 23% and 12% DRAM energy reduction with floating point 32bit on main and graphical memories, respectively. Also, it further improves system's performance by 0.43˜ 4.12% while maintaining a negligible accuracy drops in training DNNs.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130436044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078522
Ryohei Nozaki, Hiroki Nishikawa, Ittetsu Taniguchi, H. Tomiyama
High-Level Synthesis (HLS) which automatically synthesizes an RTL circuit from a behavioral description written in a high-level programming language such as C has now become popular in order to improve the design productivity. However, the area of HLS-generated circuits is often larger than that of human-designed ones. One of the reasons is that HLS tools often generate multiple instances of the same module from a C function. In this work, we propose a function-level module sharing technique in HLS.
{"title":"Function-Level Module Sharing in High-Level Synthesis","authors":"Ryohei Nozaki, Hiroki Nishikawa, Ittetsu Taniguchi, H. Tomiyama","doi":"10.1109/ISOCC47750.2019.9078522","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078522","url":null,"abstract":"High-Level Synthesis (HLS) which automatically synthesizes an RTL circuit from a behavioral description written in a high-level programming language such as C has now become popular in order to improve the design productivity. However, the area of HLS-generated circuits is often larger than that of human-designed ones. One of the reasons is that HLS tools often generate multiple instances of the same module from a C function. In this work, we propose a function-level module sharing technique in HLS.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130314130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078499
Erwin Setiawan, T. Adiono
In this paper, the throughput improvement of an autocorrelation block is presented. The improvement is carried out by employing pipeline architecture. The autocorrelation block is used for time synchronization in OFDM-based Visible Light Communication (VLC) system. The autocorrelation block estimates the coarse time offset of the received OFDM data symbol. By adding pipeline registers, we can reduce the critical path of the combinational circuits by dividing it into smaller critical path, therefore the clock frequency can be increased. We show three times throughput improvement by using four stages pipeline architecture. The maximum clock frequency of the block is 188 MHz. The block has been implemented and verified on Xilinx Zynq-7000 FPGA.
{"title":"Throughput Improvement of an Autocorrelation Block for Time Synchronization in OFDM-based LiFi","authors":"Erwin Setiawan, T. Adiono","doi":"10.1109/ISOCC47750.2019.9078499","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078499","url":null,"abstract":"In this paper, the throughput improvement of an autocorrelation block is presented. The improvement is carried out by employing pipeline architecture. The autocorrelation block is used for time synchronization in OFDM-based Visible Light Communication (VLC) system. The autocorrelation block estimates the coarse time offset of the received OFDM data symbol. By adding pipeline registers, we can reduce the critical path of the combinational circuits by dividing it into smaller critical path, therefore the clock frequency can be increased. We show three times throughput improvement by using four stages pipeline architecture. The maximum clock frequency of the block is 188 MHz. The block has been implemented and verified on Xilinx Zynq-7000 FPGA.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132700009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078488
Kenta Shirane, Takahiro Yamamoto, Ittetsu Taniguchi, Yuko Hara-Azumi, S. Yamashita, H. Tomiyama
Approximate computing is considered as a processing approach to design of area-, power- or performance-efficient circuits. Approaches to approximate computing are to replace an exact arithmetic circuit with an approximate circuit. In this paper, we propose a methodology to systematically design a series of approximate array multipliers with different accuracy, area, power and delay. A circuit designer can select the one which satisfies the requirement on accuracy. Experimental results show the effectiveness of our approximate multipliers against existing approximate multipliers.
{"title":"Maximum Error-Aware Design of Approximate Array Multipliers","authors":"Kenta Shirane, Takahiro Yamamoto, Ittetsu Taniguchi, Yuko Hara-Azumi, S. Yamashita, H. Tomiyama","doi":"10.1109/ISOCC47750.2019.9078488","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078488","url":null,"abstract":"Approximate computing is considered as a processing approach to design of area-, power- or performance-efficient circuits. Approaches to approximate computing are to replace an exact arithmetic circuit with an approximate circuit. In this paper, we propose a methodology to systematically design a series of approximate array multipliers with different accuracy, area, power and delay. A circuit designer can select the one which satisfies the requirement on accuracy. Experimental results show the effectiveness of our approximate multipliers against existing approximate multipliers.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"402 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133137328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078498
Yong-Deok Ahn, Sungjae Oh, Sungjin Kim, Kangyoon Lee
This paper proposes a Low Inrush Current Low Dropout Regulator circuit using the method of pre-charging the load. The Low Dropout Regulator is a circuit to provide an efficient and stable supply voltage. One of the most important things of Low Dropout Regulator design is controlling the inrush current. The output voltage of the Low Dropout Regulator may be unstable when the inrush current is Large. The method proposed in this paper reduced inrush current from 438 mA to 5 mA. The designed Low Dropout Regulator has been implemented 55nm CMOS fabrication.
{"title":"A Design of Low Inrush Current Low Dropout Regulator Using the Method of Pre-charging the Load","authors":"Yong-Deok Ahn, Sungjae Oh, Sungjin Kim, Kangyoon Lee","doi":"10.1109/ISOCC47750.2019.9078498","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078498","url":null,"abstract":"This paper proposes a Low Inrush Current Low Dropout Regulator circuit using the method of pre-charging the load. The Low Dropout Regulator is a circuit to provide an efficient and stable supply voltage. One of the most important things of Low Dropout Regulator design is controlling the inrush current. The output voltage of the Low Dropout Regulator may be unstable when the inrush current is Large. The method proposed in this paper reduced inrush current from 438 mA to 5 mA. The designed Low Dropout Regulator has been implemented 55nm CMOS fabrication.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132997239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078506
Yun-Nan Chang, Yu-Tang Tin
This paper proposes a neural network training scheme in order to obtain the network weights represented in the fixed-point number format such that under the different truncated lengths of the weights, our neural new network can all achieve near-optimized inference accuracy at the corresponding word-length. The similar idea has been explored before; however, the salient feature of our proposed scaling bit-progressive method is we have further taken into account the use and training of weight scaling factor, which can significant improve the inference accuracy. Our experimental results show that our trained Resnet- 18 neural network can improve the top-1 and top-5 accuracies of Tiny-ImageNet dataset by the average of 11.02% and 9.21% compared with the previous work without using the scaling factor. The top-1 and top-5 accuracy losses compared with float-point weights are only about 0.5% and 0.31% under the truncated size of 5-bit. The proposed method can be applied for neural network accelerators especially for those which support bit-serial processing.
{"title":"Scaling Bit-Flexible Neural Networks","authors":"Yun-Nan Chang, Yu-Tang Tin","doi":"10.1109/ISOCC47750.2019.9078506","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078506","url":null,"abstract":"This paper proposes a neural network training scheme in order to obtain the network weights represented in the fixed-point number format such that under the different truncated lengths of the weights, our neural new network can all achieve near-optimized inference accuracy at the corresponding word-length. The similar idea has been explored before; however, the salient feature of our proposed scaling bit-progressive method is we have further taken into account the use and training of weight scaling factor, which can significant improve the inference accuracy. Our experimental results show that our trained Resnet- 18 neural network can improve the top-1 and top-5 accuracies of Tiny-ImageNet dataset by the average of 11.02% and 9.21% compared with the previous work without using the scaling factor. The top-1 and top-5 accuracy losses compared with float-point weights are only about 0.5% and 0.31% under the truncated size of 5-bit. The proposed method can be applied for neural network accelerators especially for those which support bit-serial processing.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116072058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078527
K. Cho, Jihye Kim, Hyunggoy Oh, Sangjun Lee, Sungho Kang
Scan-based testing, though widely used in modern digital designs, causes more power consumption than in functional mode. This excessive power consumption can cause severe hazards such as circuit reliability and yield loss. To solve this problem, a new scan chain reordering method based on care bit density is proposed in this paper. This proposed method helps merging care bits toward the front end of scan chains. Thus, it can reduce scan cell switching activities during scan shift operation. Experimental results on ISCAS’89 benchmark circuits show that the proposed scan chain reordering method reduces test power consumption compared to the previous work.
{"title":"A New Scan Chain Reordering Method for Low Power Consumption based on Care Bit Density","authors":"K. Cho, Jihye Kim, Hyunggoy Oh, Sangjun Lee, Sungho Kang","doi":"10.1109/ISOCC47750.2019.9078527","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078527","url":null,"abstract":"Scan-based testing, though widely used in modern digital designs, causes more power consumption than in functional mode. This excessive power consumption can cause severe hazards such as circuit reliability and yield loss. To solve this problem, a new scan chain reordering method based on care bit density is proposed in this paper. This proposed method helps merging care bits toward the front end of scan chains. Thus, it can reduce scan cell switching activities during scan shift operation. Experimental results on ISCAS’89 benchmark circuits show that the proposed scan chain reordering method reduces test power consumption compared to the previous work.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078489
Zhaoqian Zhong, M. Edahiro
In this paper, we propose a model-based approach to parallelize Simulink models on multicore CPUs and NVIDIA GPUs at the block level and generate CUDA C codes for parallel execution. In our proposed approach, the Simulink models are converted to directed acyclic graphs (DAGs) based on their block diagrams, wherein the nodes represent tasks of grouped blocks in the model and the edges represent the communication behaviors between blocks. Next, a path analysis is conducted on the DAGs to extract all execution paths and calculate the length of each path, which comprises the execution times of tasks and the communication times of edges on the path. Then, an integer linear programming (ILP) formulation is used to minimize the length of the critical path of the DAG, which represents the execution time of the Simulink model. The ILP formulation also balances the workloads on each CPU core for optimized hardware utilization. We evaluate the proposed approach by parallelizing an image processing model on a platform of two homogeneous CPU cores and two GPUs to determine its effectiveness.
{"title":"Model-based Parallelization for Simulink Models on Multicore CPUs and GPUs","authors":"Zhaoqian Zhong, M. Edahiro","doi":"10.1109/ISOCC47750.2019.9078489","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078489","url":null,"abstract":"In this paper, we propose a model-based approach to parallelize Simulink models on multicore CPUs and NVIDIA GPUs at the block level and generate CUDA C codes for parallel execution. In our proposed approach, the Simulink models are converted to directed acyclic graphs (DAGs) based on their block diagrams, wherein the nodes represent tasks of grouped blocks in the model and the edges represent the communication behaviors between blocks. Next, a path analysis is conducted on the DAGs to extract all execution paths and calculate the length of each path, which comprises the execution times of tasks and the communication times of edges on the path. Then, an integer linear programming (ILP) formulation is used to minimize the length of the critical path of the DAG, which represents the execution time of the Simulink model. The ILP formulation also balances the workloads on each CPU core for optimized hardware utilization. We evaluate the proposed approach by parallelizing an image processing model on a platform of two homogeneous CPU cores and two GPUs to determine its effectiveness.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130177667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078512
Chenglong Xiao, Shanshan Wang, Wanjun Liu, Haicheng Qu, Xinlin Wang
Custom instruction selection is one of the most com- putationally difficult problems involved in the custom instruction identification for application-specific instruction-set processors. Most of existing research try to solve the custom instruction selection problem using sequential algorithms on a single compute node. Considering the high complexity of the problem, this paper proposes an efficient parallel method based on multi-depth graph partitioning for selecting custom instruction. Experimental result- s show that the proposed parallel custom instruction selection method outperforms two of the latest parallel methods and can achieve near-linear speedup.
{"title":"Runtime Estimation Model Based Graph Partitioning for Parallel Custom Instruction Selection","authors":"Chenglong Xiao, Shanshan Wang, Wanjun Liu, Haicheng Qu, Xinlin Wang","doi":"10.1109/ISOCC47750.2019.9078512","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078512","url":null,"abstract":"Custom instruction selection is one of the most com- putationally difficult problems involved in the custom instruction identification for application-specific instruction-set processors. Most of existing research try to solve the custom instruction selection problem using sequential algorithms on a single compute node. Considering the high complexity of the problem, this paper proposes an efficient parallel method based on multi-depth graph partitioning for selecting custom instruction. Experimental result- s show that the proposed parallel custom instruction selection method outperforms two of the latest parallel methods and can achieve near-linear speedup.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132567592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-06DOI: 10.1109/ISOCC47750.2019.9078533
Youngsu Kwon, Yong Cheol Peter Cho, Jeongmin Yang, Jaehoon Chung, Kyoung-Seon Shin, Jinho Han, Chan Kim, C. Lyuh, Hyun-Mi Kim, I. S. Jeon, Minseok Choi
AI processors are extending the application area into mobile and edge devices. The requirement of low power consumption which has been an essential factor in designing processors is now becoming the most critical factor for mobile AI processors to be viable. The high performance requirement exacerbates the energy crisis caused by a large area due to a lot of processing engines required for implementing AI processors. We present the design of an AI processor targeting both CNN and MLP processing in autonomous vehicles. The proposed AI processor integrates Super-Thread-Core composed of 16384 nano cores in mesh-grid network for neural network acceleration. The performance of the processor reaches 32 Tera FLOPS enabling hyper real-time execution of CNN and MLP. Each nano core is programmable by a sequence of instructions compiled from the neural network description by the proprietary AI-Ware. The mesh-array of nano cores at the heart of neural computing accounts for most of the power consumption. The AI-ware compiler enables adaptive power gating by dynamically compiling the commands based on the temperature profile reducing 50% of total power consumption.
{"title":"AI 32TFLOPS Autonomous Driving Processor on AI-Ware with Adaptive Power Saving","authors":"Youngsu Kwon, Yong Cheol Peter Cho, Jeongmin Yang, Jaehoon Chung, Kyoung-Seon Shin, Jinho Han, Chan Kim, C. Lyuh, Hyun-Mi Kim, I. S. Jeon, Minseok Choi","doi":"10.1109/ISOCC47750.2019.9078533","DOIUrl":"https://doi.org/10.1109/ISOCC47750.2019.9078533","url":null,"abstract":"AI processors are extending the application area into mobile and edge devices. The requirement of low power consumption which has been an essential factor in designing processors is now becoming the most critical factor for mobile AI processors to be viable. The high performance requirement exacerbates the energy crisis caused by a large area due to a lot of processing engines required for implementing AI processors. We present the design of an AI processor targeting both CNN and MLP processing in autonomous vehicles. The proposed AI processor integrates Super-Thread-Core composed of 16384 nano cores in mesh-grid network for neural network acceleration. The performance of the processor reaches 32 Tera FLOPS enabling hyper real-time execution of CNN and MLP. Each nano core is programmable by a sequence of instructions compiled from the neural network description by the proprietary AI-Ware. The mesh-array of nano cores at the heart of neural computing accounts for most of the power consumption. The AI-ware compiler enables adaptive power gating by dynamically compiling the commands based on the temperature profile reducing 50% of total power consumption.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}