An Ultra Low-Power Hardware Accelerator for Acoustic Scoring in Speech Recognition

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2017-09-01 DOI:10.1109/PACT.2017.11

Hamid Tabani, J. Arnau, Jordi Tubella, Antonio González

{"title":"An Ultra Low-Power Hardware Accelerator for Acoustic Scoring in Speech Recognition","authors":"Hamid Tabani, J. Arnau, Jordi Tubella, Antonio González","doi":"10.1109/PACT.2017.11","DOIUrl":null,"url":null,"abstract":"Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large impact on the accuracy of the system and takes up the bulk of execution time. The vast majority of ASR systems implement the acoustic scoring by means of Gaussian Mixture Models (GMMs), where the acoustic scores are obtained by evaluating multidimensional Gaussian distributions.In this paper, we propose a hardware accelerator for GMM evaluation that reduces the energy required for acoustic scoring by three orders of magnitude compared to solutions based on CPUs and GPUs. Our accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the acoustic model, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large impact on the accuracy of the system and takes up the bulk of execution time. The vast majority of ASR systems implement the acoustic scoring by means of Gaussian Mixture Models (GMMs), where the acoustic scores are obtained by evaluating multidimensional Gaussian distributions.In this paper, we propose a hardware accelerator for GMM evaluation that reduces the energy required for acoustic scoring by three orders of magnitude compared to solutions based on CPUs and GPUs. Our accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the acoustic model, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

语音识别中声学评分的超低功耗硬件加速器

准确、实时的自动语音识别(ASR)需要很高的能源成本，因此为了适应移动系统严格的功率限制，往往要牺牲准确性。然而，准确性对于最终用户来说是极其重要的，而今天的系统对于许多应用来说仍然不能令人满意。ASR系统中最关键的组件是声学评分，因为它对系统的准确性有很大的影响，并且占用了大量的执行时间。绝大多数ASR系统通过高斯混合模型(GMMs)实现声学评分，其中声学评分是通过评估多维高斯分布获得的。在本文中，我们提出了一种用于GMM评估的硬件加速器，与基于cpu和gpu的解决方案相比，它将声学评分所需的能量减少了三个数量级。我们的加速器实现了一种惰性求值方案，根据需要计算高斯函数，避免了50%的计算。此外，它采用了一种新颖的聚类方案来减小声学模型的大小，从而节省了8倍的内存带宽，而对精度的影响可以忽略不计。最后，它包含了一种新颖的记忆机制，避免了74.88%的浮点运算。与在现代移动CPU上运行的高度调优实现相比，最终设计提供了164倍的加速和3532倍的能耗降低。与最先进的移动GPU相比，GMM加速器在高度优化的CUDA实现上实现了5.89倍的加速，同时减少了241倍的能量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量