首页 > 最新文献

Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion最新文献

英文 中文
Prediction based convolution neural network acceleration: work-in-progress 基于卷积神经网络加速预测的研究进展
Y. Yao, Zhonghai Lu
Although intra-layer parallelism is commonly used to expedite CNN execution, it is difficult to achieve inter-layer parallelism because of data dependence between layers. In the paper, we propose a two-phase prediction and correction mechanism to break the data dependence between CNN layers so as to enable inter-layer parallelism. Our technique achieves one more order of magnitude (from the order of 10 to the order of 100) CNN acceleration compared to other three state-of-the-art GPU based CNN acceleration mechanisms.
虽然层内并行通常用于加快CNN的执行速度,但由于层之间的数据依赖,很难实现层间并行。在本文中,我们提出了一种两阶段的预测和校正机制来打破CNN层之间的数据依赖,从而实现层间并行。与其他三种基于GPU的CNN加速机制相比,我们的技术实现了一个数量级(从10的数量级到100的数量级)的CNN加速。
{"title":"Prediction based convolution neural network acceleration: work-in-progress","authors":"Y. Yao, Zhonghai Lu","doi":"10.1145/3125501.3125523","DOIUrl":"https://doi.org/10.1145/3125501.3125523","url":null,"abstract":"Although intra-layer parallelism is commonly used to expedite CNN execution, it is difficult to achieve inter-layer parallelism because of data dependence between layers. In the paper, we propose a two-phase prediction and correction mechanism to break the data dependence between CNN layers so as to enable inter-layer parallelism. Our technique achieves one more order of magnitude (from the order of 10 to the order of 100) CNN acceleration compared to other three state-of-the-art GPU based CNN acceleration mechanisms.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122321314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A "high resilience" mode to minimize soft error vulnerabilities in ARM cortex-R CPU pipelines: work-in-progress 一个“高弹性”模式,以尽量减少ARM cortex-R CPU管道中的软错误漏洞:正在进行中
X. Iturbe, Balaji Venu, John Penton, Emre Ozer
This paper proposes a "high resilience" execution mode to increase the robustness of CPU pipelines to soft errors when executing critical software routines. The proposed execution mode reduces the error rate by approximately 11% in an ARM Cortex-R5 CPU, and requires only a few minor modifications to be made in its microarchitecture. These modifications do not impact the characteristic area, power consumption and performance features of the original CPU.
本文提出了一种“高弹性”执行模式,以提高CPU管道在执行关键软件例程时对软错误的鲁棒性。所提出的执行模式在ARM Cortex-R5 CPU中减少了大约11%的错误率,并且只需要对其微架构进行一些微小的修改。这些修改不影响原有CPU的特征面积、功耗和性能特征。
{"title":"A \"high resilience\" mode to minimize soft error vulnerabilities in ARM cortex-R CPU pipelines: work-in-progress","authors":"X. Iturbe, Balaji Venu, John Penton, Emre Ozer","doi":"10.1145/3125501.3125509","DOIUrl":"https://doi.org/10.1145/3125501.3125509","url":null,"abstract":"This paper proposes a \"high resilience\" execution mode to increase the robustness of CPU pipelines to soft errors when executing critical software routines. The proposed execution mode reduces the error rate by approximately 11% in an ARM Cortex-R5 CPU, and requires only a few minor modifications to be made in its microarchitecture. These modifications do not impact the characteristic area, power consumption and performance features of the original CPU.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132511799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Enabling reliable main memory using STT-MRAM via restore-aware memory management: work-in-progress 通过恢复感知内存管理,使用STT-MRAM启用可靠的主存:正在进行的工作
Armin Haj Aboutalebi, Lide Duan
As an important non-volatile memory technology, STT-MRAM is widely considered as a universal memory solution in current processors. Employing STT-MRAM as the main memory offers a wide variety of benefits, but also results in unique design challenges. In particular, read disturbance characterizes accidental data corruption in STT-MRAM after it is read, leading to a need of restoring data back to memory after each read operation. These extra restores significantly degrade system performance and energy efficiency, greatly changing the timing scenarios that conventional designs were optimized for. As a result, directly adopting conventional, restore-agnostic memory management techniques may lead to sub-optimal designs for STT-MRAM. In this work, we propose Restore-Aware Policy Selection (RAPS), a dynamic and hybrid row buffer management scheme that factors in the inevitable data restores in STT-MRAM-based main memory. RAPS monitors the row buffer hit rate at run time, dynamically switching between the open- and close-page policies. By factoring in restores, RAPS accurately captures the optimal design points, achieving optimal policy selections at run time. Our experimental results show that RAPS significantly improves system performance and energy efficiency compared to the conventional policies.
作为一种重要的非易失性存储技术,STT-MRAM被广泛认为是当前处理器中通用的存储解决方案。采用STT-MRAM作为主存储器提供了各种各样的好处,但也带来了独特的设计挑战。特别是,读取干扰是STT-MRAM中读取后意外数据损坏的特征,导致每次读取操作后都需要将数据恢复到内存中。这些额外的恢复大大降低了系统性能和能源效率,极大地改变了传统设计优化的时间方案。因此,直接采用传统的、与恢复无关的内存管理技术可能会导致STT-MRAM的次优设计。在这项工作中,我们提出了恢复感知策略选择(RAPS),这是一种动态和混合的行缓冲管理方案,考虑了基于stt - mram的主存储器中不可避免的数据恢复。RAPS在运行时监视行缓冲区命中率,在打开和关闭页面策略之间动态切换。通过考虑恢复,RAPS可以准确地捕获最佳设计点,在运行时实现最佳策略选择。实验结果表明,与传统策略相比,RAPS显著提高了系统性能和能源效率。
{"title":"Enabling reliable main memory using STT-MRAM via restore-aware memory management: work-in-progress","authors":"Armin Haj Aboutalebi, Lide Duan","doi":"10.1145/3125501.3125517","DOIUrl":"https://doi.org/10.1145/3125501.3125517","url":null,"abstract":"As an important non-volatile memory technology, STT-MRAM is widely considered as a universal memory solution in current processors. Employing STT-MRAM as the main memory offers a wide variety of benefits, but also results in unique design challenges. In particular, read disturbance characterizes accidental data corruption in STT-MRAM after it is read, leading to a need of restoring data back to memory after each read operation. These extra restores significantly degrade system performance and energy efficiency, greatly changing the timing scenarios that conventional designs were optimized for. As a result, directly adopting conventional, restore-agnostic memory management techniques may lead to sub-optimal designs for STT-MRAM. In this work, we propose Restore-Aware Policy Selection (RAPS), a dynamic and hybrid row buffer management scheme that factors in the inevitable data restores in STT-MRAM-based main memory. RAPS monitors the row buffer hit rate at run time, dynamically switching between the open- and close-page policies. By factoring in restores, RAPS accurately captures the optimal design points, achieving optimal policy selections at run time. Our experimental results show that RAPS significantly improves system performance and energy efficiency compared to the conventional policies.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115213782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving NVMe SSD I/O determinism with PCIe virtual channel: work-in-progress 使用PCIe虚拟通道改进NVMe SSD I/O确定性:正在进行中
Seonbong Kim, Joon-Sung Yang
NVMe SSD over PCIe is attractive since it provides high throughput and low latency. However, complex internal SSD operations may cause a non-deterministic I/O latency which is one of the most important factors in a storage system. While conventional approaches to enhance I/O latency prediction are based on host systems, this paper proposes a novel SSD-based deterministic latency enhancement scheme. The proposed method exploits the fact that multiple virtual channels can be utilized. For each virtual channel, the proposed method assigns a different priority for data transmission. NVMe SSD analyses its internal latency and dynamically chooses the virtual channels to compensate the latency. The experimental results show that, using a PCIe switch model, the proposed method can save 41.6% of the latency for each transaction layer packet.
NVMe SSD通过PCIe是有吸引力的,因为它提供高吞吐量和低延迟。然而,复杂的SSD内部操作可能会导致不确定的I/O延迟,这是存储系统中最重要的因素之一。传统的提高I/O延迟预测的方法是基于主机系统的,本文提出了一种新的基于ssd的确定性延迟增强方案。该方法利用了多个虚拟信道可以被利用的特点。对于每个虚拟通道,该方法分配了不同的数据传输优先级。NVMe SSD通过分析其内部延迟,动态选择虚拟通道来补偿延迟。实验结果表明,在PCIe交换模型下,该方法可将每个事务层数据包的延迟降低41.6%。
{"title":"Improving NVMe SSD I/O determinism with PCIe virtual channel: work-in-progress","authors":"Seonbong Kim, Joon-Sung Yang","doi":"10.1145/3125501.3125520","DOIUrl":"https://doi.org/10.1145/3125501.3125520","url":null,"abstract":"NVMe SSD over PCIe is attractive since it provides high throughput and low latency. However, complex internal SSD operations may cause a non-deterministic I/O latency which is one of the most important factors in a storage system. While conventional approaches to enhance I/O latency prediction are based on host systems, this paper proposes a novel SSD-based deterministic latency enhancement scheme. The proposed method exploits the fact that multiple virtual channels can be utilized. For each virtual channel, the proposed method assigns a different priority for data transmission. NVMe SSD analyses its internal latency and dynamically chooses the virtual channels to compensate the latency. The experimental results show that, using a PCIe switch model, the proposed method can save 41.6% of the latency for each transaction layer packet.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128362237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-grained performance estimation for MPSoC compilers: work-in-progress MPSoC编译器的多粒度性能估计:正在进行的工作
M. Aguilar, Abhishek Aggarwal, Awaid Shaheen, R. Leupers, G. Ascheid, J. Castrillón, L. Fitzpatrick
Parallelizing compilers are a promising solution to tackle key challenges of MPSoC programming. One fundamental aspect for a profitable parallelization is to estimate the performance of the applications on the target platforms. There is a wide range of state-of-the-art performance estimation techniques, such as, simulation-based, measurement-based, among others. They provide performance estimates typically only at function or basic block granularity. However, MPSoC compilers require performance information at other granularities, such as statement, loop or even arbitrary code blocks. In this paper, we propose a framework to adapt performance information sources to any granularity required by an MPSoC compiler.
并行编译器是解决MPSoC编程关键挑战的一个有前途的解决方案。有利的并行化的一个基本方面是估计目标平台上应用程序的性能。有许多最先进的性能评估技术,例如基于模拟的、基于测量的等等。它们通常只在函数或基本块粒度上提供性能估计。然而,MPSoC编译器需要其他粒度的性能信息,如语句、循环甚至任意代码块。在本文中,我们提出了一个框架,使性能信息源适应MPSoC编译器所需的任何粒度。
{"title":"Multi-grained performance estimation for MPSoC compilers: work-in-progress","authors":"M. Aguilar, Abhishek Aggarwal, Awaid Shaheen, R. Leupers, G. Ascheid, J. Castrillón, L. Fitzpatrick","doi":"10.1145/3125501.3125521","DOIUrl":"https://doi.org/10.1145/3125501.3125521","url":null,"abstract":"Parallelizing compilers are a promising solution to tackle key challenges of MPSoC programming. One fundamental aspect for a profitable parallelization is to estimate the performance of the applications on the target platforms. There is a wide range of state-of-the-art performance estimation techniques, such as, simulation-based, measurement-based, among others. They provide performance estimates typically only at function or basic block granularity. However, MPSoC compilers require performance information at other granularities, such as statement, loop or even arbitrary code blocks. In this paper, we propose a framework to adapt performance information sources to any granularity required by an MPSoC compiler.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116984913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A high-performance FPGA accelerator for sparse neural networks: work-in-progress 用于稀疏神经网络的高性能FPGA加速器:正在开发中
Yuntao Lu, Lei Gong, Chongchong Xu, Fan Sun, Yiwei Zhang, Chao Wang, Xuehai Zhou
Neural networks have been widely used in a large range of domains, researchers tune numbers of layrs, neurons and synapses to adapt various applications. As a consequence, computations and memory of neural networks models are both intensive. As large requirements of memory and computing resources, it is difficult to deploy neural networks on resource-limited platforms. Sparse neural networks, which prune redundant neurons and synapses, alleviate computation and memory pressure. However, conventional accelerators cannot benefit from the sparse feature. In this paper, we propose a high-performance FPGA accelerator for sparse neural networks which utilizes eliminate computations and storage space. This work compresses sparse weights and processes compressed data directly. Experimental results demonstrate that our accelerator will reduce 50% and 10% storage of convolutional and full-connected layers, and achieve 3x speedup of performance over an optimized conventional FPGA accelerator.
神经网络在许多领域得到了广泛的应用,研究人员通过调整层数、神经元和突触的数量来适应不同的应用。因此,神经网络模型的计算量和记忆量都非常大。由于对内存和计算资源的需求很大,神经网络很难在资源有限的平台上部署。稀疏神经网络可以减少冗余的神经元和突触,减轻计算和记忆压力。然而,传统的加速器无法从稀疏特性中获益。在本文中,我们提出了一种高性能的FPGA加速器,用于稀疏神经网络,利用消除计算和存储空间。该工作对稀疏权值进行压缩,并直接处理压缩后的数据。实验结果表明,该加速器可将卷积层和全连接层的存储空间分别减少50%和10%,并比优化后的传统FPGA加速器提高3倍的性能。
{"title":"A high-performance FPGA accelerator for sparse neural networks: work-in-progress","authors":"Yuntao Lu, Lei Gong, Chongchong Xu, Fan Sun, Yiwei Zhang, Chao Wang, Xuehai Zhou","doi":"10.1145/3125501.3125510","DOIUrl":"https://doi.org/10.1145/3125501.3125510","url":null,"abstract":"Neural networks have been widely used in a large range of domains, researchers tune numbers of layrs, neurons and synapses to adapt various applications. As a consequence, computations and memory of neural networks models are both intensive. As large requirements of memory and computing resources, it is difficult to deploy neural networks on resource-limited platforms. Sparse neural networks, which prune redundant neurons and synapses, alleviate computation and memory pressure. However, conventional accelerators cannot benefit from the sparse feature. In this paper, we propose a high-performance FPGA accelerator for sparse neural networks which utilizes eliminate computations and storage space. This work compresses sparse weights and processes compressed data directly. Experimental results demonstrate that our accelerator will reduce 50% and 10% storage of convolutional and full-connected layers, and achieve 3x speedup of performance over an optimized conventional FPGA accelerator.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126513774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
SSS: self-aware system-on-chip using static-dynamic hybrid method (work-in-progress) SSS:采用静动态混合方法的自感知片上系统(在研)
Gaoming Du, Shibi Ma, Zhenmin Li, Zhonghai Lu, Yiming Ouyang, M. Gao
Network on chip has become the de facto communication standard for multi-core or many-core system on chip, due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC---decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring the thermal distribution for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for self-optimizing. Experimental results show SSS can reduce the peak temperature by up to 30.64%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature.
片上网络由于其可扩展性和灵活性,已成为多核或多核片上系统事实上的通信标准。然而,温度是NoC设计中的一个重要因素,它会影响SoC的整体性能——降低电路频率,增加能耗,甚至缩短芯片寿命。在本文中,我们提出了一种自感知SoC SSS,该SoC采用静态动态混合方法,将动态映射和静态映射相结合,以降低基于NoC的SoC的热点温度。首先,我们提出了监测热分布的自状态传感器。然后,在静态映射阶段,我们利用离散萤火虫算法计算不同温度模式下的最优映射解,以帮助自我决策。最后,在动态映射阶段,通过配置NoC和SoC感知单元进行自优化,实现动态映射。实验结果表明,SSS可将峰值温度降低30.64%。FPGA样机验证了SSS在降低热点温度方面的有效性和智能性。
{"title":"SSS: self-aware system-on-chip using static-dynamic hybrid method (work-in-progress)","authors":"Gaoming Du, Shibi Ma, Zhenmin Li, Zhonghai Lu, Yiming Ouyang, M. Gao","doi":"10.1145/3125501.3125527","DOIUrl":"https://doi.org/10.1145/3125501.3125527","url":null,"abstract":"Network on chip has become the de facto communication standard for multi-core or many-core system on chip, due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC---decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring the thermal distribution for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for self-optimizing. Experimental results show SSS can reduce the peak temperature by up to 30.64%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126179521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient pulsed-latch implementation for multiport register files: work-in-progress 多端口寄存器文件的高效脉冲锁存器实现:正在进行的工作
W. Elsharkasy, Hasan Erdem Yantır, A. Djahromi, A. Eltawil, F. Kurdahi
In this paper, register file design using pulsed latches is presented. Having some advantages in performance, area and power, pulsed latches represent an attractive implementation of register files. In addition, a proposed multiport register file architecture is introduced using single physical read/write ports to virtualize additional ports for read and write. The initial results show huge savings in area and power in comparison to the traditional architectures.
本文介绍了一种利用脉冲锁存器设计寄存器文件的方法。脉冲锁存器在性能、面积和功耗方面具有一定的优势,是一种有吸引力的寄存器文件实现方法。此外,还提出了一种使用单个物理读/写端口的多端口寄存器文件架构,以虚拟化用于读写的额外端口。初步结果表明,与传统架构相比,该架构在面积和功耗方面节省了大量资金。
{"title":"Efficient pulsed-latch implementation for multiport register files: work-in-progress","authors":"W. Elsharkasy, Hasan Erdem Yantır, A. Djahromi, A. Eltawil, F. Kurdahi","doi":"10.1145/3125501.3125515","DOIUrl":"https://doi.org/10.1145/3125501.3125515","url":null,"abstract":"In this paper, register file design using pulsed latches is presented. Having some advantages in performance, area and power, pulsed latches represent an attractive implementation of register files. In addition, a proposed multiport register file architecture is introduced using single physical read/write ports to virtualize additional ports for read and write. The initial results show huge savings in area and power in comparison to the traditional architectures.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126583626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Enabling NVM-based deep learning acceleration using nonuniform data quantization: work-in-progress
Hao Yan, Ethan C. Ahn, Lide Duan
Apart from employing a co-processor (e.g., GPU) for neural network (NN) computation, utilizing the unique characteristics of nonvolatile memories (NVM), including RRAM, phase change memory (PCM), and STT-MRAM, to accelerate NN algorithms has been extensively studied. In such approaches, input data and synaptic weights are represented using word line voltages and cell resistance, with the resulting bit line current indicating the calculation result. However, the limited number of resistance levels in a NVM cell largely reduces the algorithm data precision, thus significantly lowering the model inference accuracy. Motivated by the observation that the conventional, uniformly generated data quantization points are not equally important to the model, we propose a nonuniform data quantization scheme to better represent the model in NVM cells and minimize the inference accuracy loss. Our experimental results show that the proposed scheme can achieve highly accurate deep learning model inference using as low as only 4 bits for synaptic weight representation. This effectively enables a NVM with few cell resistance levels (e.g., STT-MRAM) to perform NN calculation, and also results in additional benefits in performance, energy, and memory storage.
除了使用协处理器(如GPU)进行神经网络(NN)计算外,利用非易失性存储器(NVM)的独特特性,包括RRAM,相变存储器(PCM)和STT-MRAM,来加速神经网络算法已经得到了广泛的研究。在这种方法中,输入数据和突触权重用字线电压和单元电阻表示,由此产生的位线电流表示计算结果。然而,NVM单元中有限的电阻水平大大降低了算法的数据精度,从而大大降低了模型的推理精度。由于观察到传统的、均匀生成的数据量化点对模型并不同等重要,我们提出了一种非均匀数据量化方案,以更好地在NVM单元中表示模型并最小化推理精度损失。我们的实验结果表明,所提出的方案可以实现高度精确的深度学习模型推理,仅需4位的突触权重表示。这有效地使具有少量单元电阻水平的NVM(例如STT-MRAM)能够执行神经网络计算,并且还在性能,能量和内存存储方面带来额外的好处。
{"title":"Enabling NVM-based deep learning acceleration using nonuniform data quantization: work-in-progress","authors":"Hao Yan, Ethan C. Ahn, Lide Duan","doi":"10.1145/3125501.3125516","DOIUrl":"https://doi.org/10.1145/3125501.3125516","url":null,"abstract":"Apart from employing a co-processor (e.g., GPU) for neural network (NN) computation, utilizing the unique characteristics of nonvolatile memories (NVM), including RRAM, phase change memory (PCM), and STT-MRAM, to accelerate NN algorithms has been extensively studied. In such approaches, input data and synaptic weights are represented using word line voltages and cell resistance, with the resulting bit line current indicating the calculation result. However, the limited number of resistance levels in a NVM cell largely reduces the algorithm data precision, thus significantly lowering the model inference accuracy. Motivated by the observation that the conventional, uniformly generated data quantization points are not equally important to the model, we propose a nonuniform data quantization scheme to better represent the model in NVM cells and minimize the inference accuracy loss. Our experimental results show that the proposed scheme can achieve highly accurate deep learning model inference using as low as only 4 bits for synaptic weight representation. This effectively enables a NVM with few cell resistance levels (e.g., STT-MRAM) to perform NN calculation, and also results in additional benefits in performance, energy, and memory storage.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123691857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
REDEFINE®™: a case for WCET-friendly hardware accelerators for real time applications (work-in-progress) REDEFINE®™:用于实时应用的wcet友好硬件加速器案例(正在开发中)
K. Madhu, Tarun Singla, S. Nandy, R. Narayan, Francois Neumann, P. Baufreton
REDEFINE is a distributed dynamic dataflow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). This paper dwells on the flexibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of real time applications.
REDEFINE是一种分布式动态数据流架构,设计用于利用各种粒度的并行性作为嵌入式片上系统(SoC)。本文讨论了在加速实时应用中,与计算实时应用执行时间界限的WCET分析器相结合的REDEFINE体系结构及其执行模型的灵活性。
{"title":"REDEFINE®™: a case for WCET-friendly hardware accelerators for real time applications (work-in-progress)","authors":"K. Madhu, Tarun Singla, S. Nandy, R. Narayan, Francois Neumann, P. Baufreton","doi":"10.1145/3125501.3125526","DOIUrl":"https://doi.org/10.1145/3125501.3125526","url":null,"abstract":"REDEFINE is a distributed dynamic dataflow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). This paper dwells on the flexibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of real time applications.","PeriodicalId":259093,"journal":{"name":"Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124784872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1