2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献_第4页

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU graphhie:仅在GPU上进行大规模异步图形遍历

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.41

Wei Han, Daniel Mawhirter, Bo Wu, Matthew Buland

Most GPU-based graph systems cannot handle large-scale graphs that do not fit in the GPU memory. The ever-increasing graph size demands a scale-up graph system, which can run on a single GPU with optimized memory access efficiency and well-controlled data transfer overhead. However, existing systems either incur redundant data transfers or fail to use shared memory. In this paper we present Graphie, a systemto efficiently traverse large-scale graphs on a single GPU. Graphie stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing. Graphie's high performance relies on two renaming algorithms. The first algorithm renames the vertices so that the source vertices can be easily loaded to the shared memory to reduce global memory accesses. The second algorithm inserts virtual vertices into the vertex set to rename real vertices, which enables the use of a small boolean array to track active partitions. The boolean array also resides in shared memory and can be updated in constant time. The renaming algorithms do not introduce any extra overhead in the GPU memory or graph storage on disk. Graphie's runtime overlaps data transfer with kernel execution and reuses transferred data in the GPU memory. The evaluation of Graphie on 7 real-world graphs with up to 1.8 billion edgesdemonstrates substantial speedups over X-Stream, a state-of-theart edge-centric graph processing framework on the CPU, and GraphReduce, an out-of-memory graph processing systems on GPUs.

大多数基于GPU的图形系统不能处理不适合GPU内存的大规模图形。不断增长的图形大小需要一个缩放图形系统，它可以在单个GPU上运行，具有优化的内存访问效率和良好控制的数据传输开销。然而，现有的系统要么导致冗余数据传输，要么无法使用共享内存。在本文中，我们提出了graphhie，一个在单个GPU上有效遍历大规模图形的系统。Graphie将顶点属性数据存储在GPU内存中，并将边缘数据异步传输给GPU进行处理。graphhie的高性能依赖于两种重命名算法。第一种算法重命名顶点，以便可以轻松地将源顶点加载到共享内存中，以减少全局内存访问。第二种算法将虚拟顶点插入顶点集以重命名真实顶点，这允许使用一个小布尔数组来跟踪活动分区。布尔数组也驻留在共享内存中，可以在固定时间内更新。重命名算法不会在GPU内存或磁盘上的图形存储中引入任何额外的开销。graphhie的运行时将数据传输与内核执行重叠，并在GPU内存中重用传输的数据。graphhie在7个具有18亿个边的真实图形上的评估表明，它比X-Stream (CPU上最先进的以边为中心的图形处理框架)和GraphReduce (gpu上内存不足的图形处理系统)有显著的加速。

{"title":"Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU","authors":"Wei Han, Daniel Mawhirter, Bo Wu, Matthew Buland","doi":"10.1109/PACT.2017.41","DOIUrl":"https://doi.org/10.1109/PACT.2017.41","url":null,"abstract":"Most GPU-based graph systems cannot handle large-scale graphs that do not fit in the GPU memory. The ever-increasing graph size demands a scale-up graph system, which can run on a single GPU with optimized memory access efficiency and well-controlled data transfer overhead. However, existing systems either incur redundant data transfers or fail to use shared memory. In this paper we present Graphie, a systemto efficiently traverse large-scale graphs on a single GPU. Graphie stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing. Graphie's high performance relies on two renaming algorithms. The first algorithm renames the vertices so that the source vertices can be easily loaded to the shared memory to reduce global memory accesses. The second algorithm inserts virtual vertices into the vertex set to rename real vertices, which enables the use of a small boolean array to track active partitions. The boolean array also resides in shared memory and can be updated in constant time. The renaming algorithms do not introduce any extra overhead in the GPU memory or graph storage on disk. Graphie's runtime overlaps data transfer with kernel execution and reuses transferred data in the GPU memory. The evaluation of Graphie on 7 real-world graphs with up to 1.8 billion edgesdemonstrates substantial speedups over X-Stream, a state-of-theart edge-centric graph processing framework on the CPU, and GraphReduce, an out-of-memory graph processing systems on GPUs.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128942611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

POSTER: Statement Reordering to Alleviate Register Pressure for Stencils on GPUs 海报:声明重新排序以减轻gpu上模板的注册压力

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.40

P. Rawat, Aravind Sukumaran-Rajam, A. Rountev, F. Rastello, L. Pouchet, P. Sadayappan

Compute-intensive GPU architectures allow the use of high-order 3D stencils for better computational accuracy. These stencils are usually compute-bound. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in a sub-optimal code with a large number of register spills. We develop an optimization framework that models stencils as a forest of trees and performs statement reordering to reduce register use. The effectiveness of the approach is demonstrated through experimental results on several high-order stencils.

计算密集型GPU架构允许使用高阶3D模板来提高计算精度。这些模板通常与计算机绑定在一起。虽然目前最先进的寄存器分配器对大多数应用来说是令人满意的，但它们无法有效地管理如此复杂的高阶模板的寄存器压力，从而导致具有大量寄存器溢出的次优代码。我们开发了一个优化框架，该框架将模板建模为树的森林，并执行语句重新排序以减少寄存器的使用。在若干高阶模板上的实验结果验证了该方法的有效性。

引用次数: 0

Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology 应用集群策略解决系统公平与英特尔的缓存分配技术

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.19

Vicent Selfa, J. Sahuquillo, L. Eeckhout, S. Petit, M. E. Gómez

Achieving system fairness is a major design concern in current multicore processors. Unfairness arises due to contention in the shared resources of the system, such as the LLC and main memory. To address this problem, many research works have proposed novel cache partitioning policies aimed at addressing system fairness without harming performance. Unfortunately, existing proposals targeting fairness require extra hardware which makes them impractical in commercial processors.Recent Intel Xeon processors feature Cache Allocation Technology (CAT), a hardware cache partitioning mechanism that can be controlled from userspace software and that allows to create partitions in the LLC and assign different groups of applications to them.In this paper we propose a family of clustering-based cache partitioning policies to address fairness in systems that feature Intel’s CAT. The proposal acts at two levels: applications showing similar amount of core stalls due to LLC accesses are first grouped into clusters, after which each cluster is given a number of ways using a simple mathematical model. To the best of our knowledge, this is the first attempt to address system fairness using the cache partitioning hardware in a real product. Results show that our best performing policy reduces system unfairness by up to 80% (39% on average) for 8-application workloads and by up to 45% (25% on average) for 12-application workloads compared to a non-partitioning approach.

在当前的多核处理器中，实现系统公平性是一个主要的设计关注点。不公平是由于系统共享资源(如LLC和主存)的争用引起的。为了解决这个问题，许多研究工作提出了新的缓存分区策略，旨在解决系统公平性而不损害性能。不幸的是，现有的以公平为目标的建议需要额外的硬件，这使得它们在商业处理器中不切实际。最近的英特尔至强处理器采用了缓存分配技术(CAT)，这是一种硬件缓存分区机制，可以从用户空间软件进行控制，并允许在LLC中创建分区，并将不同的应用程序组分配给它们。在本文中，我们提出了一系列基于集群的缓存分区策略，以解决以英特尔CAT为特征的系统中的公平性问题。该建议在两个层面上起作用:首先将由于LLC访问而出现相似数量的核心停机的应用程序分组到集群中，然后使用简单的数学模型为每个集群提供多种方法。据我们所知，这是在实际产品中首次尝试使用缓存分区硬件来解决系统公平性问题。结果表明，与非分区方法相比，我们性能最好的策略在8个应用程序工作负载下最多可减少80%(平均39%)的系统不公平性，在12个应用程序工作负载下最多可减少45%(平均25%)。

{"title":"Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology","authors":"Vicent Selfa, J. Sahuquillo, L. Eeckhout, S. Petit, M. E. Gómez","doi":"10.1109/PACT.2017.19","DOIUrl":"https://doi.org/10.1109/PACT.2017.19","url":null,"abstract":"Achieving system fairness is a major design concern in current multicore processors. Unfairness arises due to contention in the shared resources of the system, such as the LLC and main memory. To address this problem, many research works have proposed novel cache partitioning policies aimed at addressing system fairness without harming performance. Unfortunately, existing proposals targeting fairness require extra hardware which makes them impractical in commercial processors.Recent Intel Xeon processors feature Cache Allocation Technology (CAT), a hardware cache partitioning mechanism that can be controlled from userspace software and that allows to create partitions in the LLC and assign different groups of applications to them.In this paper we propose a family of clustering-based cache partitioning policies to address fairness in systems that feature Intel’s CAT. The proposal acts at two levels: applications showing similar amount of core stalls due to LLC accesses are first grouped into clusters, after which each cluster is given a number of ways using a simple mathematical model. To the best of our knowledge, this is the first attempt to address system fairness using the cache partitioning hardware in a real product. Results show that our best performing policy reduces system unfairness by up to 80% (39% on average) for 8-application workloads and by up to 45% (25% on average) for 12-application workloads compared to a non-partitioning approach.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132559365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

POSTER: Design Space Exploration for Performance Optimization of Deep Neural Networks on Shared Memory Accelerators 海报:共享内存加速器上深度神经网络性能优化的设计空间探索

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.39

Swagath Venkataramani, Jungwook Choi, V. Srinivasan, K. Gopalakrishnan, Leland Chang

The growing prominence and computational challenges imposed by Deep Neural Networks (DNNs) has fueled the design of specialized accelerator architectures and associated dataflows to improve their implementation efficiency. Each of these solutions serve as a datapoint on the throughput vs. energy trade-offs for a given DNN and a set of architectural constraints. In this paper, we set out to explore whether it is possible to systematically explore the design space so as to estimate a given DNN's (both inference and training) performance on an shared memory architecture specification using a variety of data-flows. To this end, we have developed a framework, DEEPMATRIX, which given a description of a DNN and a hardware architecture, automatically identifies how the computations of the DNN's layers need to partitioned and mapped on to the architecture such that the overall performance is maximized, while meeting the constraints imposed by the hardware (processing power, memory capacity, bandwidth etc.) We demonstrate DEEPMATRIX's effectiveness for the VGG DNN benchmark, showing the trade-offs and sensitivity of utilization based on different architecture constraints.

深度神经网络(dnn)的日益突出和计算挑战推动了专门加速器架构和相关数据流的设计，以提高其实现效率。这些解决方案中的每一个都可以作为给定DNN和一组架构约束的吞吐量与能量权衡的数据点。在本文中，我们着手探索是否有可能系统地探索设计空间，以便使用各种数据流估计给定DNN在共享内存架构规范上的(推理和训练)性能。为此，我们开发了一个框架，DEEPMATRIX，它给出了DNN和硬件架构的描述，自动识别DNN各层的计算如何划分和映射到架构上，从而使整体性能最大化，同时满足硬件施加的约束(处理能力，内存容量，带宽等)。我们证明了DEEPMATRIX对VGG DNN基准的有效性。显示基于不同架构约束的利用的权衡和敏感性。

{"title":"POSTER: Design Space Exploration for Performance Optimization of Deep Neural Networks on Shared Memory Accelerators","authors":"Swagath Venkataramani, Jungwook Choi, V. Srinivasan, K. Gopalakrishnan, Leland Chang","doi":"10.1109/PACT.2017.39","DOIUrl":"https://doi.org/10.1109/PACT.2017.39","url":null,"abstract":"The growing prominence and computational challenges imposed by Deep Neural Networks (DNNs) has fueled the design of specialized accelerator architectures and associated dataflows to improve their implementation efficiency. Each of these solutions serve as a datapoint on the throughput vs. energy trade-offs for a given DNN and a set of architectural constraints. In this paper, we set out to explore whether it is possible to systematically explore the design space so as to estimate a given DNN's (both inference and training) performance on an shared memory architecture specification using a variety of data-flows. To this end, we have developed a framework, DEEPMATRIX, which given a description of a DNN and a hardware architecture, automatically identifies how the computations of the DNN's layers need to partitioned and mapped on to the architecture such that the overall performance is maximized, while meeting the constraints imposed by the hardware (processing power, memory capacity, bandwidth etc.) We demonstrate DEEPMATRIX's effectiveness for the VGG DNN benchmark, showing the trade-offs and sensitivity of utilization based on different architecture constraints.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116679466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

POSTER: Putting the G back into GPU/CPU Systems Research 海报:把G放回GPU/CPU系统研究

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.60

Andreas Sembrant, Trevor E. Carlson, Erik Hagersten, D. Black-Schaffer

Modern SoCs contain several CPU cores and many GPU cores to execute both general purpose and highly-parallel graphics workloads. In many SoCs, more area is dedicated to graphics than to general purpose compute. Despite this, the micro-architecture research community primarily focuses on GPGPU and CPU-only research, and not on graphics (the primary workload for many SoCs). The main reason for this is the lack of efficient tools and simulators for modern graphics applications.This work focuses on the GPU's memory traffic generated by graphics. We describe a new graphics tracing framework and use it to both study graphics applications' memory behavior as well as how CPUs and GPUs affect system performance. Our results show that graphics applications exhibit a wide range of memory behavior between applications and across time, and slows down co-running SPEC applications by 59% on average.

现代soc包含几个CPU内核和许多GPU内核来执行通用和高度并行的图形工作负载。在许多soc中，专用于图形的区域比专用于通用计算的区域要多。尽管如此，微体系结构研究社区主要关注GPGPU和cpu的研究，而不是图形(许多soc的主要工作负载)。造成这种情况的主要原因是现代图形应用程序缺乏有效的工具和模拟器。这项工作的重点是图形产生的GPU内存流量。我们描述了一个新的图形跟踪框架，并使用它来研究图形应用程序的内存行为以及cpu和gpu如何影响系统性能。我们的结果表明，图形应用程序在应用程序之间和不同时间表现出广泛的内存行为，并使共同运行的SPEC应用程序平均降低59%。

引用次数: 0

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls 海报:加速GPU并发内核执行通过减少内存管道摊位

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.30

Hongwen Dai, Zhen Lin, C. Li, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou

In this study, we demonstrate that the performance may be undermined in the state-of-the-art intra-SM sharing schemes for concurrent kernel execution (CKE) on GPUs, due to the interference among concurrent kernels. We highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose to balance memory accesses and limit the number of inflight memory instructions issued from concurrent kernels to reduce memory pipeline stalls. Our proposed schemes significantly improve the performance of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK.

在这项研究中，我们证明了在gpu上并发内核执行(CKE)的最先进的sm内部共享方案中，由于并发内核之间的干扰，性能可能会受到损害。我们强调，为cpu提出的缓存分区技术对gpu并不有效。然后，我们建议平衡内存访问并限制并发内核发出的飞行内存指令的数量，以减少内存管道的停滞。我们提出的方案显著提高了两种最先进的sm内部共享方案(warp - slicer和SMK)的性能。

引用次数: 2

Lightweight Provenance Service for High-Performance Computing 用于高性能计算的轻量级来源服务

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.14

Dong Dai, Yong Chen, P. Carns, John Jenkins, R. Ross

Provenance describes detailed information about the history of a piece of data, containing the relationships among elements such as users, processes, jobs, and workflows that contribute to the existence of data. Provenance is key to supporting many data management functionalities that are increasingly important in operations such as identifying data sources, parameters, or assumptions behind a given result; auditing data usage; or understanding details about how inputs are transformed into outputs. Despite its importance, however, provenance support is largely underdeveloped in highly parallel architectures and systems. One major challenge is the demanding requirements of providing provenance service in situ. The need to remain lightweight and to be always on often conflicts with the need to be transparent and offer an accurate catalog of details regarding the applications and systems. To tackle this challenge, we introduce a lightweight provenance service, called LPS, for high-performance computing (HPC) systems. LPS leverages a kernel instrument mechanism to achieve transparency and introduces representative execution and flexible granularity to capture comprehensive provenance with controllable overhead. Extensive evaluations and use cases have confirmed its efficiency and usability. We believe that LPS can be integrated into current and future HPC systems to support a variety of data management needs.

出处描述了关于一段数据的历史的详细信息，包含了用户、流程、作业和工作流等元素之间的关系，这些元素有助于数据的存在。来源是支持许多数据管理功能的关键，这些功能在识别数据源、参数或给定结果背后的假设等操作中越来越重要;审计数据使用情况;或者理解输入如何转化为输出的细节。尽管它很重要，但是在高度并行的体系结构和系统中，来源支持在很大程度上是不发达的。一个主要的挑战是提供就地溯源服务的苛刻要求。保持轻量级和始终在线的需求经常与透明的需求相冲突，并提供有关应用程序和系统的准确详细目录。为了应对这一挑战，我们为高性能计算(HPC)系统引入了一种称为LPS的轻量级来源服务。LPS利用内核工具机制实现透明度，并引入代表性执行和灵活的粒度，以可控的开销捕获全面的来源。广泛的评估和用例已经证实了它的效率和可用性。我们相信LPS可以集成到当前和未来的HPC系统中，以支持各种数据管理需求。

{"title":"Lightweight Provenance Service for High-Performance Computing","authors":"Dong Dai, Yong Chen, P. Carns, John Jenkins, R. Ross","doi":"10.1109/PACT.2017.14","DOIUrl":"https://doi.org/10.1109/PACT.2017.14","url":null,"abstract":"Provenance describes detailed information about the history of a piece of data, containing the relationships among elements such as users, processes, jobs, and workflows that contribute to the existence of data. Provenance is key to supporting many data management functionalities that are increasingly important in operations such as identifying data sources, parameters, or assumptions behind a given result; auditing data usage; or understanding details about how inputs are transformed into outputs. Despite its importance, however, provenance support is largely underdeveloped in highly parallel architectures and systems. One major challenge is the demanding requirements of providing provenance service in situ. The need to remain lightweight and to be always on often conflicts with the need to be transparent and offer an accurate catalog of details regarding the applications and systems. To tackle this challenge, we introduce a lightweight provenance service, called LPS, for high-performance computing (HPC) systems. LPS leverages a kernel instrument mechanism to achieve transparency and introduces representative execution and flexible granularity to capture comprehensive provenance with controllable overhead. Extensive evaluations and use cases have confirmed its efficiency and usability. We believe that LPS can be integrated into current and future HPC systems to support a variety of data management needs.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133072125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

An Ultra Low-Power Hardware Accelerator for Acoustic Scoring in Speech Recognition 语音识别中声学评分的超低功耗硬件加速器

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.11

Hamid Tabani, J. Arnau, Jordi Tubella, Antonio González

Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large impact on the accuracy of the system and takes up the bulk of execution time. The vast majority of ASR systems implement the acoustic scoring by means of Gaussian Mixture Models (GMMs), where the acoustic scores are obtained by evaluating multidimensional Gaussian distributions.In this paper, we propose a hardware accelerator for GMM evaluation that reduces the energy required for acoustic scoring by three orders of magnitude compared to solutions based on CPUs and GPUs. Our accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the acoustic model, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.

准确、实时的自动语音识别(ASR)需要很高的能源成本，因此为了适应移动系统严格的功率限制，往往要牺牲准确性。然而，准确性对于最终用户来说是极其重要的，而今天的系统对于许多应用来说仍然不能令人满意。ASR系统中最关键的组件是声学评分，因为它对系统的准确性有很大的影响，并且占用了大量的执行时间。绝大多数ASR系统通过高斯混合模型(GMMs)实现声学评分，其中声学评分是通过评估多维高斯分布获得的。在本文中，我们提出了一种用于GMM评估的硬件加速器，与基于cpu和gpu的解决方案相比，它将声学评分所需的能量减少了三个数量级。我们的加速器实现了一种惰性求值方案，根据需要计算高斯函数，避免了50%的计算。此外，它采用了一种新颖的聚类方案来减小声学模型的大小，从而节省了8倍的内存带宽，而对精度的影响可以忽略不计。最后，它包含了一种新颖的记忆机制，避免了74.88%的浮点运算。与在现代移动CPU上运行的高度调优实现相比，最终设计提供了164倍的加速和3532倍的能耗降低。与最先进的移动GPU相比，GMM加速器在高度优化的CUDA实现上实现了5.89倍的加速，同时减少了241倍的能量。

{"title":"An Ultra Low-Power Hardware Accelerator for Acoustic Scoring in Speech Recognition","authors":"Hamid Tabani, J. Arnau, Jordi Tubella, Antonio González","doi":"10.1109/PACT.2017.11","DOIUrl":"https://doi.org/10.1109/PACT.2017.11","url":null,"abstract":"Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large impact on the accuracy of the system and takes up the bulk of execution time. The vast majority of ASR systems implement the acoustic scoring by means of Gaussian Mixture Models (GMMs), where the acoustic scores are obtained by evaluating multidimensional Gaussian distributions.In this paper, we propose a hardware accelerator for GMM evaluation that reduces the energy required for acoustic scoring by three orders of magnitude compared to solutions based on CPUs and GPUs. Our accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the acoustic model, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128232422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches 余地:在最后一级缓存的死块预测中寻址可变性

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.32

P. Faldu, Boris Grot

The looming breakdown of Moore's Law and the end of voltage scaling are ushering a new era where neither transistors nor the energy to operate them is free. This calls for a new regime in computer systems, one in which every transistor counts. Caches are essential for processor performance and represent the bulk of modern processor's transistor budget. To get more performance out of the cache hierarchy, future processors will rely on effective cache management policies.This paper identifies variability in generational behavior of cache blocks as a key challenge for cache management policies that aim to identify dead blocks as early and as accurately as possible to maximize cache efficiency. We show that existing management policies are limited by the metrics they use to identify dead blocks, leading to low coverage and/or low accuracy in the face of variability. In response, we introduce a new metric – Live Distance – that uses the stack distance to learn the temporal reuse characteristics of cache blocks, thus enabling a dead block predictor that is robust to variability in generational behavior. Based on the reuse characteristics of an application's cache blocks, our predictor – Leeway – classifies application's behavior as streaming-oriented or reuse-oriented and dynamically selects an appropriate cache management policy. By leveraging live distance for LLC management, Leeway outperforms state-of-the-art approaches on single- and multi-core SPEC and manycore CloudSuite workloads.

摩尔定律的崩溃和电压缩放的终结正在引领一个新的时代，晶体管和运行它们的能量都不是免费的。这就要求在计算机系统中建立一个新的体制，在这个体制中，每个晶体管都是重要的。缓存对于处理器的性能来说是必不可少的，并且代表了现代处理器晶体管预算的大部分。为了从缓存层次结构中获得更高的性能，未来的处理器将依赖于有效的缓存管理策略。本文认为，缓存块分代行为的可变性是缓存管理策略的一个关键挑战，该策略旨在尽早、尽可能准确地识别死块，以最大限度地提高缓存效率。我们表明，现有的管理政策受到他们用来识别死块的度量标准的限制，导致在面对可变性时覆盖率低和/或准确性低。作为回应，我们引入了一个新的度量——实时距离——它使用堆栈距离来学习缓存块的临时重用特征，从而实现一个对分代行为的可变性具有鲁棒性的死块预测器。基于应用程序缓存块的重用特性，我们的预测器Leeway将应用程序的行为分类为面向流或面向重用，并动态选择合适的缓存管理策略。通过利用实时距离进行LLC管理，Leeway在单核、多核SPEC和多核CloudSuite工作负载上的表现优于最先进的方法。

{"title":"Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches","authors":"P. Faldu, Boris Grot","doi":"10.1109/PACT.2017.32","DOIUrl":"https://doi.org/10.1109/PACT.2017.32","url":null,"abstract":"The looming breakdown of Moore's Law and the end of voltage scaling are ushering a new era where neither transistors nor the energy to operate them is free. This calls for a new regime in computer systems, one in which every transistor counts. Caches are essential for processor performance and represent the bulk of modern processor's transistor budget. To get more performance out of the cache hierarchy, future processors will rely on effective cache management policies.This paper identifies variability in generational behavior of cache blocks as a key challenge for cache management policies that aim to identify dead blocks as early and as accurately as possible to maximize cache efficiency. We show that existing management policies are limited by the metrics they use to identify dead blocks, leading to low coverage and/or low accuracy in the face of variability. In response, we introduce a new metric – Live Distance – that uses the stack distance to learn the temporal reuse characteristics of cache blocks, thus enabling a dead block predictor that is robust to variability in generational behavior. Based on the reuse characteristics of an application's cache blocks, our predictor – Leeway – classifies application's behavior as streaming-oriented or reuse-oriented and dynamically selects an appropriate cache management policy. By leveraging live distance for LLC management, Leeway outperforms state-of-the-art approaches on single- and multi-core SPEC and manycore CloudSuite workloads.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133449843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees RCU-HTM:结合RCU和HTM实现高效的并发二叉搜索树

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.17

Dimitrios Siakavaras, K. Nikas, G. Goumas, N. Koziris

In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they've been read and, only if this validation is successful, installs the copy in the tree structure.We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare theirperformance to state-of-the-art lock-based, non-blocking, RCU- and HTM-basedBSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTMachieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.

在本文中，我们介绍了RCU-HTM，一种结合了读-复制-更新(RCU)和硬件事务性内存(HTM)来实现高效并发二叉搜索树(BSTs)的技术。与基于rcu的算法类似，我们在树的受影响部分的私有副本中执行树结构的修改，而不是在原位执行。这允许遍历树的线程在没有任何同步的情况下继续，也不会受到并发修改的影响。RCU-HTM的新颖之处在于利用HTM允许多个更新线程并发执行。在适当地修改私有副本之后，我们执行一个HTM事务，该事务自动验证树的所有受影响部分在读取后是否保持不变，并且只有在验证成功时，才将副本安装到树结构中。我们将RCU- htm应用于AVL和红黑平衡bst，并将其性能与最先进的基于锁、非阻塞、RCU和html的bst进行比较。我们的实验评估表明，使用rcu - html实现的bst不仅在只读操作方面具有很高的性能，而且在更新操作方面也具有很高的性能。更具体地说，我们的评估包括各种树大小和操作工作负载，并显示基于RCU-HTM的bst在具有44个硬件线程的多核服务器上平均比其他替代方案高出18%以上。

{"title":"RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees","authors":"Dimitrios Siakavaras, K. Nikas, G. Goumas, N. Koziris","doi":"10.1109/PACT.2017.17","DOIUrl":"https://doi.org/10.1109/PACT.2017.17","url":null,"abstract":"In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they've been read and, only if this validation is successful, installs the copy in the tree structure.We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare theirperformance to state-of-the-art lock-based, non-blocking, RCU- and HTM-basedBSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTMachieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134513580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8