: Scalar replacement is an e ff ective technique to improve the performance of the RTL code generated by high-level synthesis (HLS) from C programs with intensive array accesses. In scalar replacement, data accessed from arrays are stored into shift registers, and later array accesses on the same data are replaced with the accesses to the shift registers instead of the arrays. Namely, scalar replacement replaces array accesses with shift register accesses. Since arrays in C programs are usually mapped to RAMs with limited numbers of ports, reducing array accesses with scalar replacement leads to the memory access reduction, which in turn improves the performance of the resulting RTL code. In real-life C programs, sometimes, shift registers must be initialized conditionally using multiple array accesses, which increases the number of array accesses in main loops. To reduce the conditional array access in the main loops, the previous scalar replacement method proposed the use of a loop transformation called loop peeling. Loop peeling brings significant increase in code size, leading to the negative impacts on performance or circuit area of the synthesized hardware. In this paper, we propose a new method to initialize shift registers without loop peeling. The proposed method works as a preprocessing of the input C program prior to scalar replacement. With experimental results, we demonstrate the proposed method reduces the numbers of execution cycles of the synthesized hardware compared to the previous method.
{"title":"Shift Register Initialization in Scalar Replacement for Reducing Code Size","authors":"Kenshu Seto","doi":"10.2197/ipsjtsldm.13.2","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.13.2","url":null,"abstract":": Scalar replacement is an e ff ective technique to improve the performance of the RTL code generated by high-level synthesis (HLS) from C programs with intensive array accesses. In scalar replacement, data accessed from arrays are stored into shift registers, and later array accesses on the same data are replaced with the accesses to the shift registers instead of the arrays. Namely, scalar replacement replaces array accesses with shift register accesses. Since arrays in C programs are usually mapped to RAMs with limited numbers of ports, reducing array accesses with scalar replacement leads to the memory access reduction, which in turn improves the performance of the resulting RTL code. In real-life C programs, sometimes, shift registers must be initialized conditionally using multiple array accesses, which increases the number of array accesses in main loops. To reduce the conditional array access in the main loops, the previous scalar replacement method proposed the use of a loop transformation called loop peeling. Loop peeling brings significant increase in code size, leading to the negative impacts on performance or circuit area of the synthesized hardware. In this paper, we propose a new method to initialize shift registers without loop peeling. The proposed method works as a preprocessing of the input C program prior to scalar replacement. With experimental results, we demonstrate the proposed method reduces the numbers of execution cycles of the synthesized hardware compared to the previous method.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79279120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
: In this paper, we propose a logic optimization method to remove the redundancy in the circuit. The incre- mental Automatic Test Pattern Generation method is used to find the redundant multiple faults. In order to remove as many redundancies as possible, instead of removing the redundant single faults first, we clear up the redundant faults from higher cardinality to lower cardinality. The experiments prove that the proposed method can successfully eliminate more redundancies comparing to the redundancy removal command in the synthesis tool SIS.
{"title":"A Logic Optimization Method by Eliminating Redundant Multiple Faults from Higher to Lower Cardinality","authors":"P. Wang, A. M. Gharehbaghi, M. Fujita","doi":"10.2197/ipsjtsldm.13.35","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.13.35","url":null,"abstract":": In this paper, we propose a logic optimization method to remove the redundancy in the circuit. The incre- mental Automatic Test Pattern Generation method is used to find the redundant multiple faults. In order to remove as many redundancies as possible, instead of removing the redundant single faults first, we clear up the redundant faults from higher cardinality to lower cardinality. The experiments prove that the proposed method can successfully eliminate more redundancies comparing to the redundancy removal command in the synthesis tool SIS.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81283440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the progress of semiconductor process miniaturization, delay degradation by aging increases and threatens the reliability of fabricated chips. The amount of delay degradation is known to be circuit and workload dependent, but previous evaluations are based on simulations, and delay degradation measurement of real circuit under realistic workload has not been reported yet. This paper proposes real circuit delay measurement method, which achieves enough accuracy to measure circuit and workload dependent delay degradation. In the proposed method, onchip oscillator supplies fine resolution variable frequency clock to internal circuit. Internal circuit execute test pattern to activate critical paths at various frequency and determine the maximum frequency at which correct results can be obtained. The maximum frequency corresponds to the delay of the critical paths activated by the test pattern. Clock multiplication improves delay resolution, and repetitive measurement reduces measurement error caused by time dependent random delay variation. The proposed method has been implemented on a 65 nm low power process test chip. Variable frequency oscillator utilizes only standard cells and is designed with automatic layout flow without any timing tuning. The area overhead of the proposed method is 0.09% of the total random logic. The evaluation result show that 0.18% average measurement accuracy has been achieved.
{"title":"Real Circuit Delay Measurement Method by Variable Frequency Operation with On-Chip Fine Resolution Oscillator","authors":"K. Shimamura, Naohiro Ikeda","doi":"10.2197/ipsjtsldm.13.21","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.13.21","url":null,"abstract":"With the progress of semiconductor process miniaturization, delay degradation by aging increases and threatens the reliability of fabricated chips. The amount of delay degradation is known to be circuit and workload dependent, but previous evaluations are based on simulations, and delay degradation measurement of real circuit under realistic workload has not been reported yet. This paper proposes real circuit delay measurement method, which achieves enough accuracy to measure circuit and workload dependent delay degradation. In the proposed method, onchip oscillator supplies fine resolution variable frequency clock to internal circuit. Internal circuit execute test pattern to activate critical paths at various frequency and determine the maximum frequency at which correct results can be obtained. The maximum frequency corresponds to the delay of the critical paths activated by the test pattern. Clock multiplication improves delay resolution, and repetitive measurement reduces measurement error caused by time dependent random delay variation. The proposed method has been implemented on a 65 nm low power process test chip. Variable frequency oscillator utilizes only standard cells and is designed with automatic layout flow without any timing tuning. The area overhead of the proposed method is 0.09% of the total random logic. The evaluation result show that 0.18% average measurement accuracy has been achieved.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73915270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Koichi Fujiwara, Kazushi Kawamura, M. Yanagisawa, N. Togawa
{"title":"An FPGA Implementation Method based on Distributed-register Architectures","authors":"Koichi Fujiwara, Kazushi Kawamura, M. Yanagisawa, N. Togawa","doi":"10.2197/ipsjtsldm.12.38","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.38","url":null,"abstract":"","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86945903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Circuit Techniques for Device-Circuit Interaction toward Minimum Energy Operation","authors":"A. Islam, H. Onodera","doi":"10.2197/ipsjtsldm.12.2","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.2","url":null,"abstract":"","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87699381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salita Sombatsiri, S. Shibata, Yuki Kobayashi, Hiroaki Inoue, Takashi Takenaka, T. Hosomi, Jaehoon Yu, Yoshinori Takeuchi
This paper proposes a convolution core for sparse CNN that is capable of flexibly alternating the parallelism schemes and degree exploiting intraand inter-output parallelism of the convolutional layer, and leveraging weight sparsity using a compressed sparse model in the compressed sparse column format and output-stationary dataflow. The experimental results show that the performance is improved by 3.9 times even in the deeper layer where the conventional accelerator could not fully exploit the parallelism due to the small layer size. The proposed architecture could also exploit the weight sparsity. Then, by combining both the multi-parallelism and the weight sparsity, the proposed architecture achieved 5.2 times better performance than the conventional accelerator.
{"title":"Parallelism-flexible Convolution Core for Sparse Convolutional Neural Networks on FPGA","authors":"Salita Sombatsiri, S. Shibata, Yuki Kobayashi, Hiroaki Inoue, Takashi Takenaka, T. Hosomi, Jaehoon Yu, Yoshinori Takeuchi","doi":"10.2197/ipsjtsldm.12.22","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.22","url":null,"abstract":"This paper proposes a convolution core for sparse CNN that is capable of flexibly alternating the parallelism schemes and degree exploiting intraand inter-output parallelism of the convolutional layer, and leveraging weight sparsity using a compressed sparse model in the compressed sparse column format and output-stationary dataflow. The experimental results show that the performance is improved by 3.9 times even in the deeper layer where the conventional accelerator could not fully exploit the parallelism due to the small layer size. The proposed architecture could also exploit the weight sparsity. Then, by combining both the multi-parallelism and the weight sparsity, the proposed architecture achieved 5.2 times better performance than the conventional accelerator.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78625225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scalar replacement is one of effective array access optimizations that can be applied before High-level synthesis (HLS). The successful application of scalar replacement removes local memories, and as a result, it decreases hardware area. In addition, scalar replacement reduces the numbers of hardware execution cycles by reducing memory access conflicts. In scalar replacement, shift registers are introduced to remove local arrays, and reuse distances corresponds to the lengths of the shift registers. Previous scalar replacement methods implement the shift registers with chains of registers, so that the hardware area becomes large when the reuse distances are large. In addition, when reuse distances are unknown at compile time, previous scalar replacement methods require multiplexers with large numbers of inputs, which further increase on hardware area. In this paper, we propose a new technique to resolve the issues. In particular, we implement the shift registers with circular buffers instead of chains of registers. Large shift registers implemented by RAM-based circular buffers are more compact than those implemented by the chains of registers. We also show that the proposed method requires no multiplexers to realize scalar replacement for loops with statically unknown reuse distances, which leads to area-efficient hardware implementation. We developed a tool that implements the method and applied the tool to the benchmark programs which require large shift registers or have statically unknown reuse distances. We found that the hardware area is reduced with the proposed method compared to the previous method without sacrificing the hardware performance. We conclude that the proposed method is an area efficient scalar replacement method for programs that have large or unknown reuse distances at compile time.
{"title":"Scalar Replacement with Circular Buffers","authors":"Kenshu Seto","doi":"10.2197/ipsjtsldm.12.13","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.13","url":null,"abstract":"Scalar replacement is one of effective array access optimizations that can be applied before High-level synthesis (HLS). The successful application of scalar replacement removes local memories, and as a result, it decreases hardware area. In addition, scalar replacement reduces the numbers of hardware execution cycles by reducing memory access conflicts. In scalar replacement, shift registers are introduced to remove local arrays, and reuse distances corresponds to the lengths of the shift registers. Previous scalar replacement methods implement the shift registers with chains of registers, so that the hardware area becomes large when the reuse distances are large. In addition, when reuse distances are unknown at compile time, previous scalar replacement methods require multiplexers with large numbers of inputs, which further increase on hardware area. In this paper, we propose a new technique to resolve the issues. In particular, we implement the shift registers with circular buffers instead of chains of registers. Large shift registers implemented by RAM-based circular buffers are more compact than those implemented by the chains of registers. We also show that the proposed method requires no multiplexers to realize scalar replacement for loops with statically unknown reuse distances, which leads to area-efficient hardware implementation. We developed a tool that implements the method and applied the tool to the benchmark programs which require large shift registers or have statically unknown reuse distances. We found that the hardware area is reduced with the proposed method compared to the previous method without sacrificing the hardware performance. We conclude that the proposed method is an area efficient scalar replacement method for programs that have large or unknown reuse distances at compile time.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82540316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Miyazaki, Shunsuke Takai, Ittetsu Taniguchi, H. Tomiyama
This paper presents an OpenCL-based software framework which we have developed for a heterogeneous multicore architecture on Zynq-7000 SoC. In this work, the heterogeneous architecture is designed with two hardmacro Cortex-A9 cores and two soft-macro MicroBlaze cores. A major advantage of our OpenCL framework is that it can execute OpenCL kernel programs in three ways. Experiments show the usefulness of the OpenCL framework.
{"title":"An OpenCL-based Software Framework for a Heterogeneous Multicore Architecture on Zynq-7000 SoC","authors":"T. Miyazaki, Shunsuke Takai, Ittetsu Taniguchi, H. Tomiyama","doi":"10.2197/ipsjtsldm.12.46","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.46","url":null,"abstract":"This paper presents an OpenCL-based software framework which we have developed for a heterogeneous multicore architecture on Zynq-7000 SoC. In this work, the heterogeneous architecture is designed with two hardmacro Cortex-A9 cores and two soft-macro MicroBlaze cores. A major advantage of our OpenCL framework is that it can execute OpenCL kernel programs in three ways. Experiments show the usefulness of the OpenCL framework.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88776011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
: The end of Moore’s Law and von Neumann bottleneck motivate researchers to seek alternative architec- tures that can fulfill the increasing demand for computation resources which cannot be easily achieved by traditional computing paradigm. As one important practice, neuromorphic computing systems (NCS) are proposed to mimic bi- ological behaviors of neurons and synapses, and accelerate computation of neural networks. Traditional CMOS-based implementation of NCS, however, are subject to large hardware cost required to precisely replicate the biological prop- erties. In very recent decade, emerging nonvolatile memory (eNVM) was introduced to NCS design due to its high computing e ffi ciency and integration density. Similar to the circuits built on other nanoscale devices, eNVM-based NCS also su ff ers from many reliability issues. In this paper, we give a short survey about CMOS- and eNVM-based NCS, including their basic implementations and training and inference schemes in various applications. We also dis- cuss the design challenges of these NCS and introduce some techniques that can improve the reliability, precision, scalability, and security of the NCS. At the end, we provide our insights on the design trend and future challenges of the NCS.
{"title":"Neuromorphic Computing Systems: From CMOS To Emerging Nonvolatile Memory","authors":"Chaofei Yang, Ximing Qiao, Yiran Chen","doi":"10.2197/ipsjtsldm.12.53","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.53","url":null,"abstract":": The end of Moore’s Law and von Neumann bottleneck motivate researchers to seek alternative architec- tures that can fulfill the increasing demand for computation resources which cannot be easily achieved by traditional computing paradigm. As one important practice, neuromorphic computing systems (NCS) are proposed to mimic bi- ological behaviors of neurons and synapses, and accelerate computation of neural networks. Traditional CMOS-based implementation of NCS, however, are subject to large hardware cost required to precisely replicate the biological prop- erties. In very recent decade, emerging nonvolatile memory (eNVM) was introduced to NCS design due to its high computing e ffi ciency and integration density. Similar to the circuits built on other nanoscale devices, eNVM-based NCS also su ff ers from many reliability issues. In this paper, we give a short survey about CMOS- and eNVM-based NCS, including their basic implementations and training and inference schemes in various applications. We also dis- cuss the design challenges of these NCS and introduce some techniques that can improve the reliability, precision, scalability, and security of the NCS. At the end, we provide our insights on the design trend and future challenges of the NCS.","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84303031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
: This paper proposes a genetic algorithm for scheduling of multiple data-parallel tasks on multicores. Un- like traditional task scheduling, this work allows individual tasks to run on multiple cores in a data-parallel fashion. Experimental results show the e ff ectiveness of the proposed algorithm over state-of-the-art
{"title":"A Genetic Algorithm for Scheduling of Data-parallel Tasks on Multicore Architectures","authors":"Yang Liu, Lin Meng, H. Tomiyama","doi":"10.2197/ipsjtsldm.12.74","DOIUrl":"https://doi.org/10.2197/ipsjtsldm.12.74","url":null,"abstract":": This paper proposes a genetic algorithm for scheduling of multiple data-parallel tasks on multicores. Un- like traditional task scheduling, this work allows individual tasks to run on multiple cores in a data-parallel fashion. Experimental results show the e ff ectiveness of the proposed algorithm over state-of-the-art","PeriodicalId":38964,"journal":{"name":"IPSJ Transactions on System LSI Design Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83670586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}