首页 > 最新文献

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design最新文献

英文 中文
A 640pW 32kHz switched-capacitor ILO analog-to-time converter for wake-up sensor applications 640pW 32kHz开关电容ILO模拟-时间转换器,用于唤醒传感器应用
N. Goux, Jean-Baptiste Casanova, G. Pillonnet, F. Badets
This paper presents the architecture and ultra-low power (ULP) implementation of a switched-capacitor injection-locked oscillator (SC-ILO) used as analog-to-time converter for wake-up sensor applications. Thanks to a novel injection-locking scheme based on switched capacitors, the SC-ILO architecture avoids the use of power-hungry constant injection current sources. The SC-ILO design parameters and transfer function, resulting from an analytical study, are determined, and used to optimize the design. The ULP implementation strategy regarding power consumption, gain, modulation bandwidth and output phase dynamic range is presented and optimized to be compatible with audio wake-up sensor application that require ultra-low power consumption but low dynamic range performances. This paper reports SC-ILO circuit experimental measurements, fabricated in a 22 nm FDSOI process. The measured chip exhibits a 129° phase-shift range, a 6kHz bandwidth leading to a 34.6dB-dynamic range for a power consumption of 640pW under 0.4V.
本文介绍了一种开关电容注入锁定振荡器(SC-ILO)的结构和超低功耗(ULP)实现,该振荡器用作唤醒传感器应用的模拟-时间转换器。由于基于开关电容的新颖注入锁定方案,SC-ILO架构避免了使用耗电的恒定注入电流源。SC-ILO设计参数和传递函数是通过分析研究确定的,并用于优化设计。提出了关于功耗、增益、调制带宽和输出相位动态范围的ULP实现策略,并对其进行了优化,以兼容需要超低功耗但低动态范围性能的音频唤醒传感器应用。本文报道了SC-ILO电路在22 nm FDSOI工艺下的实验测量结果。所测芯片具有129°相移范围,6kHz带宽,在0.4V下功耗为640pW,动态范围为34.6 db。
{"title":"A 640pW 32kHz switched-capacitor ILO analog-to-time converter for wake-up sensor applications","authors":"N. Goux, Jean-Baptiste Casanova, G. Pillonnet, F. Badets","doi":"10.1145/3370748.3406582","DOIUrl":"https://doi.org/10.1145/3370748.3406582","url":null,"abstract":"This paper presents the architecture and ultra-low power (ULP) implementation of a switched-capacitor injection-locked oscillator (SC-ILO) used as analog-to-time converter for wake-up sensor applications. Thanks to a novel injection-locking scheme based on switched capacitors, the SC-ILO architecture avoids the use of power-hungry constant injection current sources. The SC-ILO design parameters and transfer function, resulting from an analytical study, are determined, and used to optimize the design. The ULP implementation strategy regarding power consumption, gain, modulation bandwidth and output phase dynamic range is presented and optimized to be compatible with audio wake-up sensor application that require ultra-low power consumption but low dynamic range performances. This paper reports SC-ILO circuit experimental measurements, fabricated in a 22 nm FDSOI process. The measured chip exhibits a 129° phase-shift range, a 6kHz bandwidth leading to a 34.6dB-dynamic range for a power consumption of 640pW under 0.4V.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114828387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Swan: a two-step power management for distributed search engines Swan:分布式搜索引擎的两步电源管理
Liang Zhou, L. Bhuyan, Kadangode K. Ramakrishnan
The service quality of web search depends considerably on the request tail latency from Index Serving Nodes (ISNs), prompting data centers to operate them at low utilization and wasting server power. ISNs can be made more energy efficient utilizing Dynamic Voltage and Frequency Scaling (DVFS) or sleep states techniques to take advantage of slack in latency of search queries. However, state-of-the-art frameworks use a single distribution to predict a request's service time and select a high percentile tail latency to derive the CPU's frequency or sleep states. Unfortunately, this misses plenty of energy saving opportunities. In this paper, we develop a simple linear regression predictor to estimate each individual search request's service time, based on the length of the request's posting list. To use this prediction for power management, the major challenge lies in reducing miss rates for deadlines due to prediction errors, while improving energy efficiency. We present Swan, a two-Step poWer mAnagement for distributed search eNgines. For each request, Swan selects an initial, lower frequency to optimize power, and then appropriately boosts the CPU frequency just at the right time to meet the deadline. Additionally, we re-configure the time instant for boosting frequency, when a critical request arrives and avoid deadline violations. Swan is implemented on the widely-used Solr search engine and evaluated with two representative, large query traces. Evaluations show Swan outperforms state-of-the-art approaches, saving at least 39% CPU power on average.
web搜索的服务质量在很大程度上取决于来自索引服务节点(isn)的请求尾部延迟,这会促使数据中心以低利用率运行它们,从而浪费服务器的电力。isn可以利用动态电压和频率缩放(DVFS)或睡眠状态技术来提高能源效率,以利用搜索查询延迟的松弛。然而,最先进的框架使用单一分布来预测请求的服务时间,并选择高百分位数的尾部延迟来派生CPU的频率或睡眠状态。不幸的是,这错过了很多节能的机会。在本文中,我们开发了一个简单的线性回归预测器来估计每个搜索请求的服务时间,基于请求发布列表的长度。要将这种预测用于电源管理,主要的挑战在于减少由于预测错误而导致的最后期限缺勤率,同时提高能源效率。我们提出Swan,分布式搜索引擎的两步电源管理。对于每个请求,Swan选择一个初始的、较低的频率来优化功率,然后在适当的时间适当地提高CPU频率以满足最后期限。此外,当关键请求到达时,我们重新配置时间瞬间以提高频率,并避免违反截止日期。Swan在广泛使用的Solr搜索引擎上实现,并使用两个具有代表性的大型查询痕迹进行评估。评估显示,Swan优于最先进的方法,平均节省至少39%的CPU功率。
{"title":"Swan: a two-step power management for distributed search engines","authors":"Liang Zhou, L. Bhuyan, Kadangode K. Ramakrishnan","doi":"10.1145/3370748.3406573","DOIUrl":"https://doi.org/10.1145/3370748.3406573","url":null,"abstract":"The service quality of web search depends considerably on the request tail latency from Index Serving Nodes (ISNs), prompting data centers to operate them at low utilization and wasting server power. ISNs can be made more energy efficient utilizing Dynamic Voltage and Frequency Scaling (DVFS) or sleep states techniques to take advantage of slack in latency of search queries. However, state-of-the-art frameworks use a single distribution to predict a request's service time and select a high percentile tail latency to derive the CPU's frequency or sleep states. Unfortunately, this misses plenty of energy saving opportunities. In this paper, we develop a simple linear regression predictor to estimate each individual search request's service time, based on the length of the request's posting list. To use this prediction for power management, the major challenge lies in reducing miss rates for deadlines due to prediction errors, while improving energy efficiency. We present Swan, a two-Step poWer mAnagement for distributed search eNgines. For each request, Swan selects an initial, lower frequency to optimize power, and then appropriately boosts the CPU frequency just at the right time to meet the deadline. Additionally, we re-configure the time instant for boosting frequency, when a critical request arrives and avoid deadline violations. Swan is implemented on the widely-used Solr search engine and evaluated with two representative, large query traces. Evaluations show Swan outperforms state-of-the-art approaches, saving at least 39% CPU power on average.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130002114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multi-channel precision-sparsity-adapted inter-frame differential data codec for video neural network processor 面向视频神经网络处理器的多通道精度稀疏化帧间差分数据编解码器
Yixiong Yang, Zhe Yuan, Fang Su, Fanyang Cheng, Zhuqing Yuan, Huazhong Yang, Yongpan Liu
Activation I/O traffic is a critical bottleneck of video neural network processor. Recent works adopted an inter-frame difference method to reduce activation size. However, current methods can't fully adapt to the various precision and sparsity in differential data. In this paper, we propose the multi-channel precision-sparsity-adapted codec, which will separate the differential activation and encode activation in multiple channels. We analyze the most adapted encoding of each channel, and select the optimal channel number with the best performance. A two-channel codec hardware has been implemented in the ASIC accelerator, which can encode/decode activations in parallel. Experiment results show that our coding achieves 2.2x-18.2x compression rate in three scenarios with no accuracy loss, and the hardware has 42x/174x improvement on speed and energy-efficiency compared with the software codec.
激活I/O流量是视频神经网络处理器的关键瓶颈。最近的研究采用帧间差分法来减小激活尺寸。然而,现有的方法不能完全适应差分数据的精度和稀疏性。本文提出了一种多通道精确稀疏化编解码器,该编解码器将多通道中的差分激活和编码激活分离开来。我们分析了每个信道最适合的编码方式,并选择了性能最好的最优信道数。在ASIC加速器中实现了双通道编解码器硬件,可以并行编码/解码激活。实验结果表明,我们的编码在三种场景下的压缩率均达到2.2x-18.2x,且没有精度损失,与软件编解码器相比,硬件编解码器的速度和能效提高了42倍/174倍。
{"title":"Multi-channel precision-sparsity-adapted inter-frame differential data codec for video neural network processor","authors":"Yixiong Yang, Zhe Yuan, Fang Su, Fanyang Cheng, Zhuqing Yuan, Huazhong Yang, Yongpan Liu","doi":"10.1145/3370748.3407002","DOIUrl":"https://doi.org/10.1145/3370748.3407002","url":null,"abstract":"Activation I/O traffic is a critical bottleneck of video neural network processor. Recent works adopted an inter-frame difference method to reduce activation size. However, current methods can't fully adapt to the various precision and sparsity in differential data. In this paper, we propose the multi-channel precision-sparsity-adapted codec, which will separate the differential activation and encode activation in multiple channels. We analyze the most adapted encoding of each channel, and select the optimal channel number with the best performance. A two-channel codec hardware has been implemented in the ASIC accelerator, which can encode/decode activations in parallel. Experiment results show that our coding achieves 2.2x-18.2x compression rate in three scenarios with no accuracy loss, and the hardware has 42x/174x improvement on speed and energy-efficiency compared with the software codec.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132182978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DidaSel
Khushboo Rani, Sukarn Agarwal, H. Kapoor
In a multi-core system, communication across cores is managed by an on-chip interconnect called Network-on-Chip (NoC). The utilization of NoC results in limitations such as high communication delay and high network power consumption. The buffers of the NoC router consume a considerable amount of leakage power. This paper attempts to reduce leakage power consumption by using Non-Volatile Memory technology-based buffers. NVM technology has the advantage of higher density and low leakage but suffers from costly write operation, and weaker write endurance. These characteristics impact on the total network power consumption, network latency, and lifetime of the router as a whole. In this paper, we propose a write reduction technique, which is based on dirty flits present in write-back data packets. The method also suggests a dirty flit based Virtual Channel (VC) allocation technique that distributes writes in NVM technology-based VCs to improve the lifetime of NVM buffers. The experimental evaluation on the full system simulator shows that the proposed policy obtains a 53% reduction in write-back flits, which results in 27% lesser total network flit on average. All these results in a significant decrease in total and dynamic network power consumption. The policy also shows remarkable improvement in the lifetime.
{"title":"DidaSel","authors":"Khushboo Rani, Sukarn Agarwal, H. Kapoor","doi":"10.1145/3370748.3406565","DOIUrl":"https://doi.org/10.1145/3370748.3406565","url":null,"abstract":"In a multi-core system, communication across cores is managed by an on-chip interconnect called Network-on-Chip (NoC). The utilization of NoC results in limitations such as high communication delay and high network power consumption. The buffers of the NoC router consume a considerable amount of leakage power. This paper attempts to reduce leakage power consumption by using Non-Volatile Memory technology-based buffers. NVM technology has the advantage of higher density and low leakage but suffers from costly write operation, and weaker write endurance. These characteristics impact on the total network power consumption, network latency, and lifetime of the router as a whole. In this paper, we propose a write reduction technique, which is based on dirty flits present in write-back data packets. The method also suggests a dirty flit based Virtual Channel (VC) allocation technique that distributes writes in NVM technology-based VCs to improve the lifetime of NVM buffers. The experimental evaluation on the full system simulator shows that the proposed policy obtains a 53% reduction in write-back flits, which results in 27% lesser total network flit on average. All these results in a significant decrease in total and dynamic network power consumption. The policy also shows remarkable improvement in the lifetime.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125133122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Enabling efficient ReRAM-based neural network computing via crossbar structure adaptive optimization 通过横杆结构自适应优化实现基于reram的高效神经网络计算
Chenchen Liu, Fuxun Yu, Zhuwei Qin, Xiang Chen
Resistive random-access memory (ReRAM) based accelerators have been widely studied to achieve efficient neural network computing in speed and energy. Neural network optimization algorithms such as sparsity are developed to achieve efficient neural network computing on traditional computer architectures such as CPU and GPU. However, such computing efficiency improvement is hindered when deploying these algorithms on the ReRAM-based accelerator because of its unique crossbar-structural computations. And a specific algorithm and hardware co-optimization for the ReRAM-based architecture is still in a lack. In this work, we propose an efficient neural network computing framework that is specialized for the crossbar-structural computations on the ReRAM-based accelerators. The proposed framework includes a crossbar specific feature map pruning and an adaptive neural network deployment. Experimental results show our design can improve the computing accuracy by 9.1% compared with the state-of-the-art sparse neural networks. Based on a famous ReRAM-based DNN accelerator, the proposed framework demonstrates up to 1.4× speedup, 4.3× power efficiency, and 4.4× area saving.
为了在速度和能量上实现高效的神经网络计算,基于电阻随机存取存储器(ReRAM)的加速器得到了广泛的研究。为了在CPU、GPU等传统计算机架构上实现高效的神经网络计算,开发了稀疏性等神经网络优化算法。然而,在基于reram的加速器上部署这些算法时,由于其独特的横杆结构计算,这种计算效率的提高受到阻碍。而对于基于rerram的架构,具体的算法和硬件协同优化仍然缺乏。在这项工作中,我们提出了一个高效的神经网络计算框架,专门用于基于reram的加速器的交叉杆结构计算。所提出的框架包括一个跨栏特定的特征映射修剪和一个自适应神经网络部署。实验结果表明,与目前最先进的稀疏神经网络相比,我们的设计可以提高9.1%的计算精度。基于著名的基于reram的深度神经网络加速器,该框架具有高达1.4倍的加速,4.3倍的功率效率和4.4倍的面积节省。
{"title":"Enabling efficient ReRAM-based neural network computing via crossbar structure adaptive optimization","authors":"Chenchen Liu, Fuxun Yu, Zhuwei Qin, Xiang Chen","doi":"10.1145/3370748.3406581","DOIUrl":"https://doi.org/10.1145/3370748.3406581","url":null,"abstract":"Resistive random-access memory (ReRAM) based accelerators have been widely studied to achieve efficient neural network computing in speed and energy. Neural network optimization algorithms such as sparsity are developed to achieve efficient neural network computing on traditional computer architectures such as CPU and GPU. However, such computing efficiency improvement is hindered when deploying these algorithms on the ReRAM-based accelerator because of its unique crossbar-structural computations. And a specific algorithm and hardware co-optimization for the ReRAM-based architecture is still in a lack. In this work, we propose an efficient neural network computing framework that is specialized for the crossbar-structural computations on the ReRAM-based accelerators. The proposed framework includes a crossbar specific feature map pruning and an adaptive neural network deployment. Experimental results show our design can improve the computing accuracy by 9.1% compared with the state-of-the-art sparse neural networks. Based on a famous ReRAM-based DNN accelerator, the proposed framework demonstrates up to 1.4× speedup, 4.3× power efficiency, and 4.4× area saving.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127106744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An 88.6nW ozone pollutant sensing interface IC with a 159 dB dynamic range 一种88.6nW动态范围为159 dB的臭氧污染物传感接口IC
Rishika Agarwala, Peng Wang, Akhilesh Tanneeru, Bongmook Lee, V. Misra, B. Calhoun
This paper presents a low power resistive sensor interface IC designed at 0.6V for ozone pollutant sensing. The large resistance range of gas sensors poses challenges in designing a low power sensor interface. Exiting architectures are insufficient for achieving a high dynamic range while enabling low VDD operation, resulting in high power consumption regardless of the adopted architecture. We present an adaptive architecture that provides baseline resistance cancellation and dynamic current control to enable low VDD operation while maintaining a dynamic range of 159dB across 20kΩ-1MΩ. The sensor interface IC is fabricated in a 65nm bulk CMOS process and consumes 88.6nW of power which is 300x lower than the state-of-art. The full system power ranges between 116 nW - 1.09 μW which includes the proposed sensor interface IC, analog to digital converter and peripheral circuits. The sensor interface's performance was verified using custom resistive metal-oxide sensors for ozone concentrations from 50 ppb to 900 ppb.
本文设计了一种用于臭氧污染物检测的低功耗电阻传感器接口IC,设计电压为0.6V。气体传感器的大电阻范围给低功耗传感器接口的设计带来了挑战。现有架构不足以在实现高动态范围的同时实现低VDD操作,导致无论采用何种架构,功耗都很高。我们提出了一种自适应架构,提供基线电阻抵消和动态电流控制,以实现低VDD操作,同时在20kΩ-1MΩ上保持159dB的动态范围。传感器接口IC采用65nm大块CMOS工艺制造,功耗为88.6nW,比目前的水平低300倍。整个系统的功率范围在116 nW ~ 1.09 μW之间,其中包括所提出的传感器接口IC、模数转换器和外围电路。使用定制的电阻金属氧化物传感器验证了传感器接口的性能,臭氧浓度为50 ppb至900 ppb。
{"title":"An 88.6nW ozone pollutant sensing interface IC with a 159 dB dynamic range","authors":"Rishika Agarwala, Peng Wang, Akhilesh Tanneeru, Bongmook Lee, V. Misra, B. Calhoun","doi":"10.1145/3370748.3406579","DOIUrl":"https://doi.org/10.1145/3370748.3406579","url":null,"abstract":"This paper presents a low power resistive sensor interface IC designed at 0.6V for ozone pollutant sensing. The large resistance range of gas sensors poses challenges in designing a low power sensor interface. Exiting architectures are insufficient for achieving a high dynamic range while enabling low VDD operation, resulting in high power consumption regardless of the adopted architecture. We present an adaptive architecture that provides baseline resistance cancellation and dynamic current control to enable low VDD operation while maintaining a dynamic range of 159dB across 20kΩ-1MΩ. The sensor interface IC is fabricated in a 65nm bulk CMOS process and consumes 88.6nW of power which is 300x lower than the state-of-art. The full system power ranges between 116 nW - 1.09 μW which includes the proposed sensor interface IC, analog to digital converter and peripheral circuits. The sensor interface's performance was verified using custom resistive metal-oxide sensors for ozone concentrations from 50 ppb to 900 ppb.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116694190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
BLINK: bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices BLINK:位稀疏LSTM推理内核,为神经反馈设备提供高效的钙痕量提取
Zhe Chen, Garrett J. Blair, H. T. Blair, J. Cong
Miniaturized fluorescent calcium imaging microscopes are widely used for monitoring the activity of a large population of neurons in freely behaving animals in vivo. Conventional calcium image analyses extract calcium traces by iterative and bulk image processing and they are hard to meet the power and latency requirements for neurofeedback devices. In this paper, we propose the calcium image processing pipeline based on a bit-sparse long short-term memory (LSTM) inference kernel (BLINK) for efficient calcium trace extraction. It largely reduces the power and latency while remaining the trace extraction accuracy. We implemented the customized pipeline on the Ultra96 platform. It can extract calcium traces from up to 1024 cells with sub-ms latency on a single FPGA device. We designed the BLINK circuits in a 28-nm technology. Evaluation shows that the proposed bit-sparse representation can reduce the circuit area by 38.7% and save the power consumption by 38.4% without accuracy loss. The BLINK circuits achieve 410 pJ/inference, which has 6293x and 52.4x gains in energy efficiency compared to the evaluation on the high performance CPU and GPU, respectively.
小型荧光钙成像显微镜被广泛用于监测自由行为动物体内大量神经元的活动。传统的钙图像分析通过迭代和批量图像处理提取钙痕迹,难以满足神经反馈设备对功率和延迟的要求。本文提出了一种基于位稀疏长短期记忆(LSTM)推理核(BLINK)的钙图像处理流水线,用于高效提取钙微量元素。它大大降低了功耗和延迟,同时保持了痕量提取的准确性。我们在Ultra96平台上实现了定制流水线。它可以在单个FPGA器件上以亚毫秒的延迟从多达1024个单元中提取钙痕迹。我们设计了28纳米技术的BLINK电路。评估结果表明,在不损失精度的情况下,提出的位稀疏表示可以减少38.7%的电路面积,节省38.4%的功耗。BLINK电路达到410 pJ/推理,与高性能CPU和GPU的能效评估相比,分别提高了6293x和52.4x。
{"title":"BLINK: bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices","authors":"Zhe Chen, Garrett J. Blair, H. T. Blair, J. Cong","doi":"10.1145/3370748.3406552","DOIUrl":"https://doi.org/10.1145/3370748.3406552","url":null,"abstract":"Miniaturized fluorescent calcium imaging microscopes are widely used for monitoring the activity of a large population of neurons in freely behaving animals in vivo. Conventional calcium image analyses extract calcium traces by iterative and bulk image processing and they are hard to meet the power and latency requirements for neurofeedback devices. In this paper, we propose the calcium image processing pipeline based on a bit-sparse long short-term memory (LSTM) inference kernel (BLINK) for efficient calcium trace extraction. It largely reduces the power and latency while remaining the trace extraction accuracy. We implemented the customized pipeline on the Ultra96 platform. It can extract calcium traces from up to 1024 cells with sub-ms latency on a single FPGA device. We designed the BLINK circuits in a 28-nm technology. Evaluation shows that the proposed bit-sparse representation can reduce the circuit area by 38.7% and save the power consumption by 38.4% without accuracy loss. The BLINK circuits achieve 410 pJ/inference, which has 6293x and 52.4x gains in energy efficiency compared to the evaluation on the high performance CPU and GPU, respectively.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116395253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting NS-KWS:近传感器处理架构和低精度GRU的联合优化,用于始终在线的关键字识别
Qin Li, Sheng Lin, Changlu Liu, Yidong Liu, F. Qiao, Yanzhi Wang, Huazhong Yang
Keyword spotting (KWS) is a crucial front-end module in the whole speech interaction system. The always-on KWS module detects input words, then activates the energy-consuming complex backend system when keywords are detected. The performance of the KWS determines the standby performance of the whole system and the conventional KWS module encounters the power consumption bottleneck problem of the data conversion near the microphone sensor. In this paper, we propose an energy-efficient near-sensor processing architecture for always-on KWS, which could enhance continuous perception of the whole speech interaction system. By implementing the keyword detection in the analog domain after the microphone sensor, this architecture avoids energy-consuming data converter and achieves faster speed than conventional realizations. In addition, we propose a lightweight gated recurrent unit (GRU) with negligible accuracy loss to ensure the recognition performance. We also implement and fabricate the proposed KWS system with the CMOS 0.18μm process. In the system-view evaluation results, the hardware-software co-design architecture achieves 65.6% energy consumption saving and 71 times speed up than state of the art.
关键词识别是整个语音交互系统的关键前端模块。始终在线的KWS模块检测输入的单词,然后在检测到关键字时激活消耗能量的复杂后端系统。KWS的性能决定了整个系统的待机性能,传统的KWS模块遇到了麦克风传感器附近数据转换的功耗瓶颈问题。在本文中,我们提出了一种节能的近传感器处理架构,用于始终在线的语音交互系统,可以增强整个语音交互系统的连续感知。该架构通过在麦克风传感器后实现模拟域的关键字检测,避免了数据转换器的耗能,实现了比传统实现更快的速度。此外,我们提出了一种轻量级的门控循环单元(GRU),其精度损失可以忽略不计,以确保识别性能。我们还利用CMOS 0.18μm工艺实现并制造了所提出的KWS系统。在系统视图评估结果中,软硬件协同设计架构实现了65.6%的能耗节约和71倍的速度提升。
{"title":"NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting","authors":"Qin Li, Sheng Lin, Changlu Liu, Yidong Liu, F. Qiao, Yanzhi Wang, Huazhong Yang","doi":"10.1145/3370748.3407001","DOIUrl":"https://doi.org/10.1145/3370748.3407001","url":null,"abstract":"Keyword spotting (KWS) is a crucial front-end module in the whole speech interaction system. The always-on KWS module detects input words, then activates the energy-consuming complex backend system when keywords are detected. The performance of the KWS determines the standby performance of the whole system and the conventional KWS module encounters the power consumption bottleneck problem of the data conversion near the microphone sensor. In this paper, we propose an energy-efficient near-sensor processing architecture for always-on KWS, which could enhance continuous perception of the whole speech interaction system. By implementing the keyword detection in the analog domain after the microphone sensor, this architecture avoids energy-consuming data converter and achieves faster speed than conventional realizations. In addition, we propose a lightweight gated recurrent unit (GRU) with negligible accuracy loss to ensure the recognition performance. We also implement and fabricate the proposed KWS system with the CMOS 0.18μm process. In the system-view evaluation results, the hardware-software co-design architecture achieves 65.6% energy consumption saving and 71 times speed up than state of the art.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114476933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SHEAR er
Behnam Khaleghi, Sahand Salamat, Anthony Thomas, Fatemeh Asgarinejad, Yeseong Kim, Tajana Rosing
Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, followed by a bundling procedure that adds up the hypervectors to realize the encoding hypervector. The operations of HD are simple and highly parallelizable, but the large number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithmhardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. Unlike previous works that generate the encoding hypervectors in full precision and then and then perform ex-post quantization, we compute the encoding hypervectors in an approximate manner that saves resources yet affords high accuracy. We also propose a novel FPGA architecture that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904× (15.7×) and energy savings of up to 56,044× (301×) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets.
{"title":"SHEAR\u0000 er","authors":"Behnam Khaleghi, Sahand Salamat, Anthony Thomas, Fatemeh Asgarinejad, Yeseong Kim, Tajana Rosing","doi":"10.1145/3370748.3406587","DOIUrl":"https://doi.org/10.1145/3370748.3406587","url":null,"abstract":"Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, followed by a bundling procedure that adds up the hypervectors to realize the encoding hypervector. The operations of HD are simple and highly parallelizable, but the large number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithmhardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. Unlike previous works that generate the encoding hypervectors in full precision and then and then perform ex-post quantization, we compute the encoding hypervectors in an approximate manner that saves resources yet affords high accuracy. We also propose a novel FPGA architecture that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904× (15.7×) and energy savings of up to 56,044× (301×) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123791725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
BiasP BiasP
H. Kumar, Nikhil Chawla, S. Mukhopadhyay
Dynamic Voltage and Frequency Scaling (DVFS) plays an integral role in reducing the energy consumption of mobile devices, meeting the targeted performance requirements at the same time. We examine the security obliviousness of CPUFreq, the DVFS framework in Linux-kernel based systems. Since Linux-kernel based operating systems are present in a wide array of applications, the high-level CPUFreq policies are designed to be platform-independent. Using these policies, we present BiasP exploit, which restricts the allocation of CPU resources to a set of targeted applications, thereby degrading their performance. The exploit involves detecting the execution of instructions on the CPU core pertinent to the targeted applications, thereafter using CPUFreq policies to limit the available CPU resources available to those instructions. We demonstrate the practicality of the exploit by operating it on a commercial smartphone, running Android OS based on Linux-kernel. We can successfully degrade the User Interface (UI) performance of the targeted applications by increasing the frame processing time and the number of dropped frames by up to 200% and 947% for the animations belonging to the targeted-applications. We see a reduction of up to 66% in the number of retired instructions of the targeted-applications. Furthermore, we propose a robust detector which is capable of detecting exploits aimed at undermining resource allocation fairness through malicious use of the DVFS framework.
{"title":"BiasP","authors":"H. Kumar, Nikhil Chawla, S. Mukhopadhyay","doi":"10.1145/3370748.3406549","DOIUrl":"https://doi.org/10.1145/3370748.3406549","url":null,"abstract":"Dynamic Voltage and Frequency Scaling (DVFS) plays an integral role in reducing the energy consumption of mobile devices, meeting the targeted performance requirements at the same time. We examine the security obliviousness of CPUFreq, the DVFS framework in Linux-kernel based systems. Since Linux-kernel based operating systems are present in a wide array of applications, the high-level CPUFreq policies are designed to be platform-independent. Using these policies, we present BiasP exploit, which restricts the allocation of CPU resources to a set of targeted applications, thereby degrading their performance. The exploit involves detecting the execution of instructions on the CPU core pertinent to the targeted applications, thereafter using CPUFreq policies to limit the available CPU resources available to those instructions. We demonstrate the practicality of the exploit by operating it on a commercial smartphone, running Android OS based on Linux-kernel. We can successfully degrade the User Interface (UI) performance of the targeted applications by increasing the frame processing time and the number of dropped frames by up to 200% and 947% for the animations belonging to the targeted-applications. We see a reduction of up to 66% in the number of retired instructions of the targeted-applications. Furthermore, we propose a robust detector which is capable of detecting exploits aimed at undermining resource allocation fairness through malicious use of the DVFS framework.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117045131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1