Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)最新文献

英文中文

Computing in memory with FeFETs 用场效应管在内存中计算

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218640

D. Reis, M. Niemier, X. Hu

Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.

处理器和内存之间的数据传输通常是提高应用程序级性能的瓶颈。内存计算(CiM)，其中逻辑和算术运算在内存中执行，可以显著降低与数据传输相关的能耗和计算开销。紧凑、低功耗和快速的CiM设计最终可以提高应用程序级性能。本文介绍了一种基于铁电场效应晶体管(fefet)的CiM结构。CiM设计可以作为通用的随机存取存储器(RAM)，也可以执行布尔运算((N) and， (N)OR, X(N)OR, INV)以及存储器中单词之间的加法(ADD)。与现有基于其他新兴技术的CiM设计不同，FeFET-CiM通过感测放大器中的单个电流基准实现上述操作，从而实现更紧凑的设计和更低的功耗。此外，fefet的高离子/ off比使得廉价的基于电压的传感方案成为可能。基于仿真的案例研究表明，与ReRAM和STT-RAM CiM设计相比，我们的FeFET-CiM在内存中添加32位单词时可以实现~119X (~1.6X)和~1.97X (~1.5X)的加速(和能耗降低)。此外，在广泛的基准测试中，与传统(非内存)方法相比，我们的方法提供了约2.5倍的平均加速和约1.7倍的能耗降低。

{"title":"Computing in memory with FeFETs","authors":"D. Reis, M. Niemier, X. Hu","doi":"10.1145/3218603.3218640","DOIUrl":"https://doi.org/10.1145/3218603.3218640","url":null,"abstract":"Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73558725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

SPONGE 海绵

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1093/nq/s1-iii.81.390i

Hossein Farrokhbakht, H. M. Kamali, Natalie D. Enright Jerger, S. Hessabi

引用次数: 40

A 2.6 mW Single-Ended Positive Feedback LNA for 5G Applications 用于5G应用的2.6 mW单端正反馈LNA

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218619

S. Arshad, A. Beg, R. Ramzan

This paper presents the design of a single-ended positive feedback Common Gate (CG) Low Noise Amplifier (LNA) for 5G applications. Positive feedback is utilized to achieve the trade-off between the input matching, the gain and the noise factor (NF) of the LNA. The positive feedback inherently cancels the noise produced by the input CG transistor. The proposed LNA is designed and fabricated in 150 nm CMOS by L-Foundry. At 1.41 GHz, the measured S11 and S22 are better than -20 dB and -8.4 dB, respectively. The highest voltage gain is 16.17 dB with a NF of 3.64 dB. The complete chip has an area of 1 mm2. The LNA's power dissipation is only 2.6 mW with a 1 dB compression point of -13 dBm. The simple, low power and single-ended architecture of the proposed LNA allows it to be implemented in phase array and Multiple Input Multiple Output (MIMO) radars, which have limited input and output pads and constrained power budgets for on-board components.

本文介绍了一种用于5G应用的单端正反馈共门(CG)低噪声放大器(LNA)的设计。利用正反馈来实现LNA输入匹配、增益和噪声因子(NF)之间的权衡。正反馈固有地抵消了输入CG晶体管产生的噪声。LNA是由L-Foundry公司在150 nm CMOS上设计和制造的。在1.41 GHz时，测量到的S11和S22分别优于-20 dB和-8.4 dB。最高电压增益为16.17 dB, NF为3.64 dB。整个芯片的面积为1平方毫米。LNA的功耗仅为2.6 mW, 1db压缩点为-13 dBm。LNA的简单、低功耗和单端架构使其能够在相控阵和多输入多输出(MIMO)雷达中实现，这些雷达具有有限的输入和输出垫，并且限制了机载组件的功率预算。

引用次数: 0

NNest

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218647

Liu Ke, Xin He, Xuan Zhang

Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN's enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an "one-size-fits-all" hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).

{"title":"NNest","authors":"Liu Ke, Xin He, Xuan Zhang","doi":"10.1145/3218603.3218647","DOIUrl":"https://doi.org/10.1145/3218603.3218647","url":null,"abstract":"Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN's enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an \"one-size-fits-all\" hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88332430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Efficient Image Sensor Subsampling for DNN-Based Image Classification 基于dnn的高效图像传感器子采样图像分类

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218618

Jiaxian Guo, Hongxiang Gu, M. Potkonjak

Today's mobile devices are equipped with cameras capable of taking very high-resolution pictures. For computer vision tasks which require relatively low resolution, such as image classification, sub-sampling is desired to reduce the unnecessary power consumption of the image sensor. In this paper, we study the relationship between subsampling and the performance degradation of image classifiers that are based on deep neural networks (DNNs). We empirically show that subsampling with the same step size leads to very similar accuracy changes for different classifiers. In particular, we could achieve over 15x energy savings just by subsampling while suffering almost no accuracy lost. For even better energy accuracy trade-offs, we propose AdaSkip, where the row sampling resolution is adaptively changed based on the image gradient. We implement AdaSkip on an FPGA and report its energy consumption.

今天的移动设备都配备了能够拍摄高分辨率照片的摄像头。对于要求相对较低分辨率的计算机视觉任务，如图像分类，需要进行子采样以减少图像传感器的不必要功耗。本文研究了基于深度神经网络(dnn)的图像分类器的子采样与性能下降之间的关系。我们的经验表明，具有相同步长的子采样导致不同分类器非常相似的精度变化。特别是，我们可以通过子采样实现超过15倍的能源节约，同时几乎没有准确性损失。为了更好地权衡能量精度，我们提出了AdaSkip，其中行采样分辨率根据图像梯度自适应改变。我们在FPGA上实现了AdaSkip，并报告了其能耗。

引用次数: 6

Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit 通过路径整形和动态周期调整的变化感知流水线内核:一个浮点单元的案例研究

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218617

Ioannis Tsiokanos, L. Mukhanov, Dimitrios S. Nikolopoulos, G. Karakonstantis

In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.

在本文中，我们提出了一个框架，以最大限度地减少流水线设计中变化引起的时序故障，同时限制传统的基于保护带的方案所带来的任何开销。我们的方法最初限制了长延迟路径(llp)，并通过塑造路径分布将它们隔离在尽可能少的管道阶段。这样的策略有利于采用一个特殊的单元来预测孤立的llp的激励，并动态地允许只完成这些容易出错的路径的额外周期。此外，我们的框架基于从各种应用程序中提取的实际操作数执行布局后动态时序分析。这使我们能够在考虑动态数据相关路径激励的情况下估计潜在延迟变化下的误码率。当应用于在45nm工艺技术中实现IEEE-754兼容的双精度浮点单元(FPU)时，与参考设计相比，在8%延迟变化下，路径整形有助于将误码率平均降低2.71倍。集成的llp预测单元和动态周期调整避免了此类故障和任何质量损失，其成本高达0.61%的吞吐量和0.3%的面积开销，而与具有悲观边际的FPU相比，平均节省了37.95%的功率。

{"title":"Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit","authors":"Ioannis Tsiokanos, L. Mukhanov, Dimitrios S. Nikolopoulos, G. Karakonstantis","doi":"10.1145/3218603.3218617","DOIUrl":"https://doi.org/10.1145/3218603.3218617","url":null,"abstract":"In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89218197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Aggressive Slack Recycling via Transparent Pipelines 通过透明管道积极回收闲置资源

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218623

Gokul Subramanian Ravi, Mikko H. Lipasti

In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation - maximizing slack conservation and timing speculation efficiency. We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.

为了可靠地运行并产生预期的输出，现代体系结构在设计时保守地设置时间余量，以支持工作负载和环境的极端变化。不幸的是，为了实现这种可靠性而设置的保守保护带会产生时钟周期松弛，并且对性能和能源效率有害。为了解决这个问题，我们建议通过透明管道积极回收闲置产品。我们的建议执行时间推测，同时允许数据通过透明锁存器在同步边界之间异步流动。这允许时间推测迎合异步操作的平均空闲时间，而不是最关键操作的空闲时间——最大限度地保持空闲时间和时间推测效率。我们设计了一种与透明数据路径并行运行的松弛跟踪机制来估计跨操作序列的累积松弛。然后，该机制适当地提前对同步边界进行计时，以最大限度地减少浪费的空闲时间，并最大限度地节省时钟周期。我们在空间结构上实现了我们的建议，并实现了高达20%的绝对加速和高达75%的相对改进(相对于竞争机制)。

{"title":"Aggressive Slack Recycling via Transparent Pipelines","authors":"Gokul Subramanian Ravi, Mikko H. Lipasti","doi":"10.1145/3218603.3218623","DOIUrl":"https://doi.org/10.1145/3218603.3218623","url":null,"abstract":"In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation - maximizing slack conservation and timing speculation efficiency. We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"9 12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88812918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

GAS 气体

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218631

Minxuan Zhou, M. Imani, Saransh Gupta, Tajana Rosing

： Yufutsu gas condensate fi eld has a fractured reservoir which consists of conglomerate and granite. It contains condensate-rich gas. We observed some strange fl uid behaviors. Production gas oil ratio ranges from 900 m 3 /k l to 1,900 m l to 1,900 m l 3 /k l according to production area where reservoir depth is different. Gas production started l according to production area where reservoir depth is different. Gas production started l ten years ago. At the early stage of the production, gas oil ratio decreased with time. This trend is opposite from normal behavior of a gas condensate reservoir. Also, we observed that heavy gas existed on light gas in a tubing at a specifi c well. we The reservoir is approximately from 4,000 m and 1,000 m thick. pressure and temperature are varying in a wide range with the reservoir depth. Gas compositions at the top and the bottom of the reservoir are so different that the lightest component, tends to increase with depth and temperature due to the thermal diffusion because methane has larger thermal diffusion coeffi cient than the other heavier the uid behaviors. This paper reviews the above phenomena and the theory and reports current effort to increase condensate recovery using gas cycling scheme.

Yufutsu凝析气田为砾岩-花岗岩裂缝性储层。它含有富含凝析油的气体。我们观察到一些奇怪的流体行为。根据储层深度不同的生产区，生产气油比在900 ~ 1900 ~ 1900 m3 / kl之间。根据储层深度不同的生产区域开始采气。天然气生产始于10年前。在生产初期，随着时间的推移，气油比逐渐减小。这种趋势与凝析气藏的正常行为相反。此外，我们还观察到，在某口井的油管中，重质气体存在于轻质气体中。水库的厚度大约在4000米到1000米之间。压力和温度随储层深度的变化范围很大。储层顶部和底部的气体组成差异很大，由于甲烷的热扩散系数大于其他流体行为较重的气体，因此最轻的组分会随着深度和温度的增加而增加。本文综述了上述现象和理论，并报道了目前利用气循环方案提高凝析油采收率的努力。

{"title":"GAS","authors":"Minxuan Zhou, M. Imani, Saransh Gupta, Tajana Rosing","doi":"10.1145/3218603.3218631","DOIUrl":"https://doi.org/10.1145/3218603.3218631","url":null,"abstract":"： Yufutsu gas condensate fi eld has a fractured reservoir which consists of conglomerate and granite. It contains condensate-rich gas. We observed some strange fl uid behaviors. Production gas oil ratio ranges from 900 m 3 /k l to 1,900 m l to 1,900 m l 3 /k l according to production area where reservoir depth is different. Gas production started l according to production area where reservoir depth is different. Gas production started l ten years ago. At the early stage of the production, gas oil ratio decreased with time. This trend is opposite from normal behavior of a gas condensate reservoir. Also, we observed that heavy gas existed on light gas in a tubing at a specifi c well. we The reservoir is approximately from 4,000 m and 1,000 m thick. pressure and temperature are varying in a wide range with the reservoir depth. Gas compositions at the top and the bottom of the reservoir are so different that the lightest component, tends to increase with depth and temperature due to the thermal diffusion because methane has larger thermal diffusion coeffi cient than the other heavier the uid behaviors. This paper reviews the above phenomena and the theory and reports current effort to increase condensate recovery using gas cycling scheme.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74675870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

CLINK 发出叮当声

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218637

Zhe Chen, Andrew G. Howe, H. T. Blair, J. Cong

Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on Virtex-VU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5-2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.

引用次数: 5

Taming the beast: Programming Peta-FLOP class Deep Learning Systems 驯服野兽:编程Peta-FLOP类深度学习系统

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3241338

Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang

1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing

近年来，随着深度神经网络(dnn)的出现，人工智能(AI)领域见证了典型的增长，深度神经网络(dnn)在涉及图像、视频、文本和自然语言的挑战性认知任务上取得了最先进的表现。它们越来越多地部署在许多现实世界的服务和产品中，并且已经渗透到从移动/物联网设备到服务器级平台的计算设备的频谱中。然而，深度神经网络是高度计算和数据密集型的工作负载，远远超过了当今计算平台的能力。例如，最先进的图像识别dnn需要数十亿次操作才能对单个图像进行分类。另一方面，训练DNN模型需要无数次的计算，并使用需要100千兆字节内存的海量数据集。解决深度神经网络带来的计算挑战的一种方法是通过硬件加速器的设计，其计算核心，内存层次结构和互连拓扑结构专门用于匹配深度神经网络的计算和通信特性。从低功耗IP核到大规模加速器系统，已经在文献中提出了几种这样的设计。能够为dnn设计专门系统的一些因素是:(i)它们的计算可以表示为静态数据流图，(ii)它们的计算模式是规则的，没有数据依赖的控制流，并为数据重用提供了丰富的机会，(iii)它们的功能可以封装在一组几个(几十个)基本函数中(例如卷积，矩阵乘法等)。也就是说，dnn在各个层面上也表现出丰富的异质性。在各个层之间，输入和输出通道的数量以及每个特征的尺寸都有很大的不同。此外，每一层包含的操作的字节/FLOP需求变化超过两个数量级。计算特性的异质性为在加速器平台上对dnn进行时空映射提供了广泛的可能性，这是根据计算如何在架构中的不同计算元素之间进行划分以及分配给计算元素的计算如何在时间上进行排序来定义的。因此，我们不禁要问，是否有可能对映射配置的设计空间进行系统探索，从而在使用各种不同数据流的给定加速器架构上最大化DNN的性能?如何在整个处理过程中对计算进行分区和排序

{"title":"Taming the beast: Programming Peta-FLOP class Deep Learning Systems","authors":"Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang","doi":"10.1145/3218603.3241338","DOIUrl":"https://doi.org/10.1145/3218603.3241338","url":null,"abstract":"1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73272514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀