Although intra-layer parallelism is commonly used to expedite CNN execution, it is difficult to achieve inter-layer parallelism because of data dependence between layers. In the paper, we propose a two-phase prediction and correction mechanism to break the data dependence between CNN layers so as to enable inter-layer parallelism. Our technique achieves one more order of magnitude (from the order of 10 to the order of 100) CNN acceleration compared to other three state-of-the-art GPU based CNN acceleration mechanisms.
{"title":"Prediction based convolution neural network acceleration: work-in-progress","authors":"Y. Yao, Zhonghai Lu","doi":"10.1145/3125501.3125523","DOIUrl":"https://doi.org/10.1145/3125501.3125523","url":null,"abstract":"Although intra-layer parallelism is commonly used to expedite CNN execution, it is difficult to achieve inter-layer parallelism because of data dependence between layers. In the paper, we propose a two-phase prediction and correction mechanism to break the data dependence between CNN layers so as to enable inter-layer parallelism. Our technique achieves one more order of magnitude (from the order of 10 to the order of 100) CNN acceleration compared to other three state-of-the-art GPU based CNN acceleration mechanisms.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122321314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a "high resilience" execution mode to increase the robustness of CPU pipelines to soft errors when executing critical software routines. The proposed execution mode reduces the error rate by approximately 11% in an ARM Cortex-R5 CPU, and requires only a few minor modifications to be made in its microarchitecture. These modifications do not impact the characteristic area, power consumption and performance features of the original CPU.
{"title":"A \"high resilience\" mode to minimize soft error vulnerabilities in ARM cortex-R CPU pipelines: work-in-progress","authors":"X. Iturbe, Balaji Venu, John Penton, Emre Ozer","doi":"10.1145/3125501.3125509","DOIUrl":"https://doi.org/10.1145/3125501.3125509","url":null,"abstract":"This paper proposes a \"high resilience\" execution mode to increase the robustness of CPU pipelines to soft errors when executing critical software routines. The proposed execution mode reduces the error rate by approximately 11% in an ARM Cortex-R5 CPU, and requires only a few minor modifications to be made in its microarchitecture. These modifications do not impact the characteristic area, power consumption and performance features of the original CPU.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132511799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As an important non-volatile memory technology, STT-MRAM is widely considered as a universal memory solution in current processors. Employing STT-MRAM as the main memory offers a wide variety of benefits, but also results in unique design challenges. In particular, read disturbance characterizes accidental data corruption in STT-MRAM after it is read, leading to a need of restoring data back to memory after each read operation. These extra restores significantly degrade system performance and energy efficiency, greatly changing the timing scenarios that conventional designs were optimized for. As a result, directly adopting conventional, restore-agnostic memory management techniques may lead to sub-optimal designs for STT-MRAM. In this work, we propose Restore-Aware Policy Selection (RAPS), a dynamic and hybrid row buffer management scheme that factors in the inevitable data restores in STT-MRAM-based main memory. RAPS monitors the row buffer hit rate at run time, dynamically switching between the open- and close-page policies. By factoring in restores, RAPS accurately captures the optimal design points, achieving optimal policy selections at run time. Our experimental results show that RAPS significantly improves system performance and energy efficiency compared to the conventional policies.
{"title":"Enabling reliable main memory using STT-MRAM via restore-aware memory management: work-in-progress","authors":"Armin Haj Aboutalebi, Lide Duan","doi":"10.1145/3125501.3125517","DOIUrl":"https://doi.org/10.1145/3125501.3125517","url":null,"abstract":"As an important non-volatile memory technology, STT-MRAM is widely considered as a universal memory solution in current processors. Employing STT-MRAM as the main memory offers a wide variety of benefits, but also results in unique design challenges. In particular, read disturbance characterizes accidental data corruption in STT-MRAM after it is read, leading to a need of restoring data back to memory after each read operation. These extra restores significantly degrade system performance and energy efficiency, greatly changing the timing scenarios that conventional designs were optimized for. As a result, directly adopting conventional, restore-agnostic memory management techniques may lead to sub-optimal designs for STT-MRAM. In this work, we propose Restore-Aware Policy Selection (RAPS), a dynamic and hybrid row buffer management scheme that factors in the inevitable data restores in STT-MRAM-based main memory. RAPS monitors the row buffer hit rate at run time, dynamically switching between the open- and close-page policies. By factoring in restores, RAPS accurately captures the optimal design points, achieving optimal policy selections at run time. Our experimental results show that RAPS significantly improves system performance and energy efficiency compared to the conventional policies.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115213782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NVMe SSD over PCIe is attractive since it provides high throughput and low latency. However, complex internal SSD operations may cause a non-deterministic I/O latency which is one of the most important factors in a storage system. While conventional approaches to enhance I/O latency prediction are based on host systems, this paper proposes a novel SSD-based deterministic latency enhancement scheme. The proposed method exploits the fact that multiple virtual channels can be utilized. For each virtual channel, the proposed method assigns a different priority for data transmission. NVMe SSD analyses its internal latency and dynamically chooses the virtual channels to compensate the latency. The experimental results show that, using a PCIe switch model, the proposed method can save 41.6% of the latency for each transaction layer packet.
{"title":"Improving NVMe SSD I/O determinism with PCIe virtual channel: work-in-progress","authors":"Seonbong Kim, Joon-Sung Yang","doi":"10.1145/3125501.3125520","DOIUrl":"https://doi.org/10.1145/3125501.3125520","url":null,"abstract":"NVMe SSD over PCIe is attractive since it provides high throughput and low latency. However, complex internal SSD operations may cause a non-deterministic I/O latency which is one of the most important factors in a storage system. While conventional approaches to enhance I/O latency prediction are based on host systems, this paper proposes a novel SSD-based deterministic latency enhancement scheme. The proposed method exploits the fact that multiple virtual channels can be utilized. For each virtual channel, the proposed method assigns a different priority for data transmission. NVMe SSD analyses its internal latency and dynamically chooses the virtual channels to compensate the latency. The experimental results show that, using a PCIe switch model, the proposed method can save 41.6% of the latency for each transaction layer packet.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128362237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Aguilar, Abhishek Aggarwal, Awaid Shaheen, R. Leupers, G. Ascheid, J. Castrillón, L. Fitzpatrick
Parallelizing compilers are a promising solution to tackle key challenges of MPSoC programming. One fundamental aspect for a profitable parallelization is to estimate the performance of the applications on the target platforms. There is a wide range of state-of-the-art performance estimation techniques, such as, simulation-based, measurement-based, among others. They provide performance estimates typically only at function or basic block granularity. However, MPSoC compilers require performance information at other granularities, such as statement, loop or even arbitrary code blocks. In this paper, we propose a framework to adapt performance information sources to any granularity required by an MPSoC compiler.
{"title":"Multi-grained performance estimation for MPSoC compilers: work-in-progress","authors":"M. Aguilar, Abhishek Aggarwal, Awaid Shaheen, R. Leupers, G. Ascheid, J. Castrillón, L. Fitzpatrick","doi":"10.1145/3125501.3125521","DOIUrl":"https://doi.org/10.1145/3125501.3125521","url":null,"abstract":"Parallelizing compilers are a promising solution to tackle key challenges of MPSoC programming. One fundamental aspect for a profitable parallelization is to estimate the performance of the applications on the target platforms. There is a wide range of state-of-the-art performance estimation techniques, such as, simulation-based, measurement-based, among others. They provide performance estimates typically only at function or basic block granularity. However, MPSoC compilers require performance information at other granularities, such as statement, loop or even arbitrary code blocks. In this paper, we propose a framework to adapt performance information sources to any granularity required by an MPSoC compiler.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116984913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuntao Lu, Lei Gong, Chongchong Xu, Fan Sun, Yiwei Zhang, Chao Wang, Xuehai Zhou
Neural networks have been widely used in a large range of domains, researchers tune numbers of layrs, neurons and synapses to adapt various applications. As a consequence, computations and memory of neural networks models are both intensive. As large requirements of memory and computing resources, it is difficult to deploy neural networks on resource-limited platforms. Sparse neural networks, which prune redundant neurons and synapses, alleviate computation and memory pressure. However, conventional accelerators cannot benefit from the sparse feature. In this paper, we propose a high-performance FPGA accelerator for sparse neural networks which utilizes eliminate computations and storage space. This work compresses sparse weights and processes compressed data directly. Experimental results demonstrate that our accelerator will reduce 50% and 10% storage of convolutional and full-connected layers, and achieve 3x speedup of performance over an optimized conventional FPGA accelerator.
{"title":"A high-performance FPGA accelerator for sparse neural networks: work-in-progress","authors":"Yuntao Lu, Lei Gong, Chongchong Xu, Fan Sun, Yiwei Zhang, Chao Wang, Xuehai Zhou","doi":"10.1145/3125501.3125510","DOIUrl":"https://doi.org/10.1145/3125501.3125510","url":null,"abstract":"Neural networks have been widely used in a large range of domains, researchers tune numbers of layrs, neurons and synapses to adapt various applications. As a consequence, computations and memory of neural networks models are both intensive. As large requirements of memory and computing resources, it is difficult to deploy neural networks on resource-limited platforms. Sparse neural networks, which prune redundant neurons and synapses, alleviate computation and memory pressure. However, conventional accelerators cannot benefit from the sparse feature. In this paper, we propose a high-performance FPGA accelerator for sparse neural networks which utilizes eliminate computations and storage space. This work compresses sparse weights and processes compressed data directly. Experimental results demonstrate that our accelerator will reduce 50% and 10% storage of convolutional and full-connected layers, and achieve 3x speedup of performance over an optimized conventional FPGA accelerator.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126513774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaoming Du, Shibi Ma, Zhenmin Li, Zhonghai Lu, Yiming Ouyang, M. Gao
Network on chip has become the de facto communication standard for multi-core or many-core system on chip, due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC---decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring the thermal distribution for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for self-optimizing. Experimental results show SSS can reduce the peak temperature by up to 30.64%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature.
{"title":"SSS: self-aware system-on-chip using static-dynamic hybrid method (work-in-progress)","authors":"Gaoming Du, Shibi Ma, Zhenmin Li, Zhonghai Lu, Yiming Ouyang, M. Gao","doi":"10.1145/3125501.3125527","DOIUrl":"https://doi.org/10.1145/3125501.3125527","url":null,"abstract":"Network on chip has become the de facto communication standard for multi-core or many-core system on chip, due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC---decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring the thermal distribution for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for self-optimizing. Experimental results show SSS can reduce the peak temperature by up to 30.64%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126179521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Elsharkasy, Hasan Erdem Yantır, A. Djahromi, A. Eltawil, F. Kurdahi
In this paper, register file design using pulsed latches is presented. Having some advantages in performance, area and power, pulsed latches represent an attractive implementation of register files. In addition, a proposed multiport register file architecture is introduced using single physical read/write ports to virtualize additional ports for read and write. The initial results show huge savings in area and power in comparison to the traditional architectures.
{"title":"Efficient pulsed-latch implementation for multiport register files: work-in-progress","authors":"W. Elsharkasy, Hasan Erdem Yantır, A. Djahromi, A. Eltawil, F. Kurdahi","doi":"10.1145/3125501.3125515","DOIUrl":"https://doi.org/10.1145/3125501.3125515","url":null,"abstract":"In this paper, register file design using pulsed latches is presented. Having some advantages in performance, area and power, pulsed latches represent an attractive implementation of register files. In addition, a proposed multiport register file architecture is introduced using single physical read/write ports to virtualize additional ports for read and write. The initial results show huge savings in area and power in comparison to the traditional architectures.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126583626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apart from employing a co-processor (e.g., GPU) for neural network (NN) computation, utilizing the unique characteristics of nonvolatile memories (NVM), including RRAM, phase change memory (PCM), and STT-MRAM, to accelerate NN algorithms has been extensively studied. In such approaches, input data and synaptic weights are represented using word line voltages and cell resistance, with the resulting bit line current indicating the calculation result. However, the limited number of resistance levels in a NVM cell largely reduces the algorithm data precision, thus significantly lowering the model inference accuracy. Motivated by the observation that the conventional, uniformly generated data quantization points are not equally important to the model, we propose a nonuniform data quantization scheme to better represent the model in NVM cells and minimize the inference accuracy loss. Our experimental results show that the proposed scheme can achieve highly accurate deep learning model inference using as low as only 4 bits for synaptic weight representation. This effectively enables a NVM with few cell resistance levels (e.g., STT-MRAM) to perform NN calculation, and also results in additional benefits in performance, energy, and memory storage.
{"title":"Enabling NVM-based deep learning acceleration using nonuniform data quantization: work-in-progress","authors":"Hao Yan, Ethan C. Ahn, Lide Duan","doi":"10.1145/3125501.3125516","DOIUrl":"https://doi.org/10.1145/3125501.3125516","url":null,"abstract":"Apart from employing a co-processor (e.g., GPU) for neural network (NN) computation, utilizing the unique characteristics of nonvolatile memories (NVM), including RRAM, phase change memory (PCM), and STT-MRAM, to accelerate NN algorithms has been extensively studied. In such approaches, input data and synaptic weights are represented using word line voltages and cell resistance, with the resulting bit line current indicating the calculation result. However, the limited number of resistance levels in a NVM cell largely reduces the algorithm data precision, thus significantly lowering the model inference accuracy. Motivated by the observation that the conventional, uniformly generated data quantization points are not equally important to the model, we propose a nonuniform data quantization scheme to better represent the model in NVM cells and minimize the inference accuracy loss. Our experimental results show that the proposed scheme can achieve highly accurate deep learning model inference using as low as only 4 bits for synaptic weight representation. This effectively enables a NVM with few cell resistance levels (e.g., STT-MRAM) to perform NN calculation, and also results in additional benefits in performance, energy, and memory storage.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123691857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Madhu, Tarun Singla, S. Nandy, R. Narayan, Francois Neumann, P. Baufreton
REDEFINE is a distributed dynamic dataflow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). This paper dwells on the flexibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of real time applications.
{"title":"REDEFINE®™: a case for WCET-friendly hardware accelerators for real time applications (work-in-progress)","authors":"K. Madhu, Tarun Singla, S. Nandy, R. Narayan, Francois Neumann, P. Baufreton","doi":"10.1145/3125501.3125526","DOIUrl":"https://doi.org/10.1145/3125501.3125526","url":null,"abstract":"REDEFINE is a distributed dynamic dataflow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). This paper dwells on the flexibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of real time applications.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124784872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}