Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献_第7页

EyeCoD: eye tracking system acceleration via flatcam-based algorithm & accelerator co-design eyeecod:通过基于平面摄像头的算法和加速器协同设计的眼动追踪系统加速

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-02 DOI: 10.1145/3470496.3527443

Haoran You, Cheng Wan, Yang Zhao, Zhongzhi Yu, Y. Fu, Jiayi Yuan, Shang Wu, Shunyao Zhang, Yongan Zhang, Chaojian Li, V. Boominathan, A. Veeraraghavan, Ziyun Li, Yingyan Lin

Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication and computation costs, leading to an overall system speedup of 10.95×, 3.21×, and 12.85× over general computing platforms including CPUs and GPUs, and a prior-art eye tracking processor called CIS-GEP, respectively, while maintaining the tracking accuracy. Codes are available at https://github.com/RICE-EIC/EyeCoD.

眼动追踪已经成为一种重要的人机交互方式，在许多虚拟和增强现实(VR/AR)应用中提供身历其境的体验，这些应用需要高吞吐量(例如，240 FPS)、小尺寸和增强的视觉隐私。然而，现有的眼动追踪系统仍然受到以下因素的限制:(1)由于采用了笨重的基于镜头的相机，因此外形尺寸较大;(2)摄像头与后端处理器之间通信成本高;(3)可能涉及低视觉隐私，从而禁止其更广泛的应用。为此，我们提出、开发并验证了一种基于flatcams的无透镜眼动追踪算法和加速器协同设计框架EyeCoD，在不牺牲追踪精度的情况下，使眼动追踪系统具有更小的外形尺寸和更高的系统效率，为下一代眼动追踪解决方案铺平了道路。在系统层面，我们提倡使用无镜头的FlatCams来代替基于镜头的相机，以满足移动眼动追踪系统对小尺寸的需求，这也为专用的传感处理器协同设计留出了空间，以减少所需的相机处理器通信延迟。在算法层面，EyeCoD集成了一个先预测后聚焦的流水线，首先通过分割预测感兴趣区域(ROI)，然后只关注感兴趣区域部分来估计凝视方向，大大减少了冗余计算和数据移动。在硬件层面，我们进一步开发了一个专用加速器，它(1)在上述分割和凝视估计模型之间集成了一种新的工作负载编排，(2)利用深度层的通道内重用机会，(3)利用输入特征分区来节省激活内存大小，(4)开发了一个顺序写并行读输入缓冲区，以减轻激活全局缓冲区的带宽需求。硅上测量和广泛的实验验证了我们的EyeCoD持续降低通信和计算成本，导致整体系统加速10.95倍，3.21倍和12.85倍的一般计算平台，包括cpu和gpu，以及现有技术的眼动追踪处理器CIS-GEP，分别保持跟踪精度。代码可在https://github.com/RICE-EIC/EyeCoD上获得。

{"title":"EyeCoD: eye tracking system acceleration via flatcam-based algorithm & accelerator co-design","authors":"Haoran You, Cheng Wan, Yang Zhao, Zhongzhi Yu, Y. Fu, Jiayi Yuan, Shang Wu, Shunyao Zhang, Yongan Zhang, Chaojian Li, V. Boominathan, A. Veeraraghavan, Ziyun Li, Yingyan Lin","doi":"10.1145/3470496.3527443","DOIUrl":"https://doi.org/10.1145/3470496.3527443","url":null,"abstract":"Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication and computation costs, leading to an overall system speedup of 10.95×, 3.21×, and 12.85× over general computing platforms including CPUs and GPUs, and a prior-art eye tracking processor called CIS-GEP, respectively, while maintaining the tracking accuracy. Codes are available at https://github.com/RICE-EIC/EyeCoD.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"58 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124268525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Sibyl: adaptive and extensible data placement in hybrid storage systems using online reinforcement learning Sibyl:使用在线强化学习在混合存储系统中自适应和可扩展的数据放置

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-05-15 DOI: 10.1145/3470496.3527442

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, D. Novo, Juan G'omez-Luna, S. Stuijk, H. Corporaal, O. Mutlu

Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Data placement across different devices is critical to maximize the benefits of such a hybrid system. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a "best-fit" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. Our goal is to design a new data placement technique for hybrid storage systems that overcomes these issues and provides: (1) adaptivity, by continuously learning from and adapting to the workload and the storage device characteristics, and (2) easy extensibility to a wide range of workloads and HSS configurations. We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS configurations, including dual- and tri-hybrid storage systems, and extensively compare it against four previously proposed data placement techniques (both heuristic- and machine learning-based) over a wide range of workloads. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge offuture access patterns while incurring a very modest storage overhead of only 124.4 KiB.

混合存储系统(Hybrid storage system, HSS)使用多个不同的存储设备，提供高容量、可扩展的高性能存储。跨不同设备的数据放置对于最大限度地发挥这种混合系统的优势至关重要。最近的研究提出了各种技术，旨在准确识别性能关键数据，并将其放置在“最适合”的存储设备中。不幸的是，这些技术中的大多数都是刚性的，这(1)限制了它们的自适应能力，无法很好地适应各种工作负载和存储设备配置，并且(2)使得设计人员很难将这些技术扩展到不同的存储系统配置(例如，具有不同数量或不同类型的存储设备)，而不是设计它们的配置。我们的目标是为混合存储系统设计一种新的数据放置技术，克服这些问题，并提供:(1)通过不断学习和适应工作负载和存储设备特性的适应性，以及(2)易于扩展到各种工作负载和HSS配置。我们介绍Sibyl，这是第一种在混合存储系统中使用强化学习进行数据放置的技术。Sibyl观察正在运行的工作负载以及存储设备的不同特性，以做出系统感知的数据放置决策。对于它所做的每一个决策，Sibyl都会从系统中获得奖励，用来评估其决策的长期性能影响，并不断优化其在线数据放置策略。我们在具有各种HSS配置的实际系统上实现了Sibyl，包括双混合和三混合存储系统，并在广泛的工作负载范围内将其与先前提出的四种数据放置技术(基于启发式和基于机器学习)进行了广泛的比较。我们的结果表明，与以前最好的数据放置技术相比，Sibyl在面向性能/面向成本的HSS配置中提供了21.6%/19.9%的性能提升。我们使用具有三个不同存储设备的HSS配置进行的评估表明，Sibyl的性能比最先进的数据放置策略高出23.9%-48.2%，同时显著减轻了系统架构师在设计可以同时包含三个存储设备的数据放置机制时的负担。我们表明，Sibyl达到了具有完整未来访问模式知识的oracle策略的80%的性能，同时仅产生非常适度的存储开销，仅为124.4 KiB。

{"title":"Sibyl: adaptive and extensible data placement in hybrid storage systems using online reinforcement learning","authors":"Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, D. Novo, Juan G'omez-Luna, S. Stuijk, H. Corporaal, O. Mutlu","doi":"10.1145/3470496.3527442","DOIUrl":"https://doi.org/10.1145/3470496.3527442","url":null,"abstract":"Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Data placement across different devices is critical to maximize the benefits of such a hybrid system. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a \"best-fit\" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. Our goal is to design a new data placement technique for hybrid storage systems that overcomes these issues and provides: (1) adaptivity, by continuously learning from and adapting to the workload and the storage device characteristics, and (2) easy extensibility to a wide range of workloads and HSS configurations. We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS configurations, including dual- and tri-hybrid storage systems, and extensively compare it against four previously proposed data placement techniques (both heuristic- and machine learning-based) over a wide range of workloads. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge offuture access patterns while incurring a very modest storage overhead of only 124.4 KiB.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123235224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

SeGraM: a universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping SeGraM:一个通用的硬件加速器，用于基因组序列到图和序列到序列映射

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-05-12 DOI: 10.1145/3470496.3527436

Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Z. Bingöl, G. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie S. Kim, Nika Mansouri-Ghiasi, Gagandeep Singh, Juan G'omez-Luna, N. Alserr, M. Alser, S. Subramoney, C. Alkan, Saugata Ghose, O. Mutlu

A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.

基因组序列分析的一个关键步骤是将从个体收集的测序DNA片段(即reads)映射到已知的线性参考基因组序列(即序列到序列映射)。最近的研究工作用参考基因组的基于图形的表示取代了线性参考序列，它捕获了种群中许多个体的遗传变异和多样性。将读取到基于图的参考基因组(即，序列到图的映射)在基因组分析中显著提高了质量。不幸的是，虽然序列到序列的映射用许多可用的工具和加速器进行了很好的研究，但序列到图的映射是一个更困难的计算问题，目前可用的实用软件工具要少得多。我们分析了两个最先进的序列到图映射工具，并揭示了四个关键问题。我们发现，迫切需要一种专门的、高性能的、可扩展的、低成本的算法/硬件协同设计，以缓解序列到图映射的播种和对齐步骤中的瓶颈。由于序列到序列映射可以被视为序列到图映射的一种特殊情况，因此我们的目标是设计一个对线性和基于图的读映射都有效的加速器。为此，我们提出了SeGraM，一个通用的算法/硬件共同设计的基因组图谱加速器，可以有效和高效地支持序列到图的映射和序列到序列的映射，无论是短读还是长读。据我们所知，SeGraM是第一个用于加速序列到图映射的算法/硬件协同设计。SeGraM由两个主要部分组成:(1)MinSeed，第一个基于最小化的种子加速器，它在给定的基因组图中找到候选位置;(2) BitAlign，第一个基于位向量的序列到图对齐加速器，它执行给定读取和MinSeed识别的子图之间的对齐。我们将SeGraM与高带宽内存相结合，以利用低延迟和高度并行的内存访问，从而缓解了内存瓶颈。我们证明了SeGraM为序列到图(即S2G)和序列到序列(即S2S)映射管道的多个步骤提供了显著的改进。首先，在长读取和短读取方面，SeGraM分别比最先进的S2G绘图工具高出5.9×/3.9×和106×/- 742×，同时降低功耗4.1×/4.4×和3.0×/3.2×。其次，BitAlign优于最先进的S2G对齐工具41×-539×和三个S2S对齐加速器1.2×-4.8×。我们的结论是，SeGraM是一个高性能和低成本的通用基因组图谱加速器，有效地支持序列到图和序列到序列的图谱绘制管道。

{"title":"SeGraM: a universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping","authors":"Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Z. Bingöl, G. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie S. Kim, Nika Mansouri-Ghiasi, Gagandeep Singh, Juan G'omez-Luna, N. Alserr, M. Alser, S. Subramoney, C. Alkan, Saugata Ghose, O. Mutlu","doi":"10.1145/3470496.3527436","DOIUrl":"https://doi.org/10.1145/3470496.3527436","url":null,"abstract":"A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125534583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Training personalized recommendation systems from (GPU) scratch: look forward not backwards 从头开始训练个性化推荐系统:向前看而不是向后看

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-05-10 DOI: 10.1145/3470496.3527386

Youngeun Kwon, Minsoo Rhu

Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.

个性化推荐模型(RecSys)是最受欢迎的机器学习工作负载之一。训练RecSys的一个关键挑战是其高内存容量要求，达到数百gb到tb的模型大小。在RecSys中，所谓的嵌入层占用了大部分内存，所以当前的系统采用CPU- gpu混合设计，让大的CPU内存存储需要内存的嵌入层。不幸的是，训练嵌入涉及多个内存带宽密集型操作，这与缓慢的CPU内存不一致，导致性能开销。先前的工作建议在GPU内存中缓存频繁访问的嵌入，作为过滤嵌入层流量到CPU内存的手段，但本文观察到这种缓存设计的几个限制。在这项工作中，我们提出了一种完全不同的方法来为RecSys设计嵌入缓存。我们提出的ScratchPipe架构利用RecSys训练的独特属性来开发嵌入缓存，不仅可以看到过去，还可以看到“未来”缓存访问。ScratchPipe利用这种特性来保证嵌入层的活动工作集可以“总是”在我们提出的缓存设计中被捕获，从而使嵌入层训练能够在GPU内存速度下进行。

{"title":"Training personalized recommendation systems from (GPU) scratch: look forward not backwards","authors":"Youngeun Kwon, Minsoo Rhu","doi":"10.1145/3470496.3527386","DOIUrl":"https://doi.org/10.1145/3470496.3527386","url":null,"abstract":"Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the \"future\" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can \"always\" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122941886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM SIMD2:一个广义矩阵指令集，用于加速超出GEMM的张量计算

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-05-03 DOI: 10.1145/3470496.3527411

Yunan Zhang, Po-An Tsai, Hung-Wei Tseng

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.

矩阵乘法单元(mxu)现在在每个计算平台中都很流行。使mxu如此成功的关键属性是半循环结构，它允许平铺并行性和数据重用。尽管如此，矩阵乘法并不是唯一具有这种属性的算法。我们发现许多算法具有相同的结构，不同的只是核心操作;例如，使用add-minimum而不是乘法-add。因此，具有半环结构的算法有可能通过通用矩阵操作体系结构而不是普通的mxu来加速。本文提出了一种新的编程范式SIMD2，它支持半环结构下的广义矩阵运算。除了矩阵乘法之外，SIMD2指令还加速了另外八种矩阵运算。由于SIMD2指令类似于矩阵乘法指令，因此我们能够在任何MXU体系结构之上构建SIMD2体系结构，只需进行最小的修改。我们开发了一个使用NVIDIA gpu和Tensor Cores模拟和验证SIMD2的框架。在8个应用程序中，SIMD2提供高达38.59倍的加速，比优化后的CUDA程序平均提供超过6.94倍的加速，而全芯片面积开销仅为5%。

{"title":"SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM","authors":"Yunan Zhang, Po-An Tsai, Hung-Wei Tseng","doi":"10.1145/3470496.3527411","DOIUrl":"https://doi.org/10.1145/3470496.3527411","url":null,"abstract":"Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130087576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Managing reliability skew in DNA storage 管理DNA存储的可靠性偏差

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-04-26 DOI: 10.1145/3470496.3527441

Dehui Lin, Yasamin Tabatabaee, Yash Pote, Djordje Jevdjic

DNA is emerging as an increasingly attractive medium for data storage due to a number of important and unique advantages it offers, most notably the unprecedented durability and density. While the technology is evolving rapidly, the prohibitive cost of reads and writes, the high frequency and the peculiar nature of errors occurring in DNA storage pose a significant challenge to its adoption. In this work we make a novel observation that the probability of successful recovery of a given bit from any type of a DNA-based storage system highly depends on its physical location within the DNA molecule. In other words, when used as a storage medium, some parts of DNA molecules appear significantly more reliable than others. We show that large differences in reliability between different parts of DNA molecules lead to highly inefficient use of error-correction resources, and that commonly used techniques such as unequal error-correction cannot be used to bridge the reliability gap between different locations in the context of DNA storage. We then propose two approaches to address the problem. The first approach is general and applies to any types of data; it stripes the data and ECC codewords across DNA molecules in a particular fashion such that the effects of errors are spread out evenly across different codewords and molecules, effectively de-biasing the underlying storage medium and improving the resilience against losses of entire molecules. The second approach is application-specific, and seeks to leverage the underlying reliability bias by using application-aware mapping of data onto DNA molecules such that data that requires higher reliability is stored in more reliable locations, whereas data that needs lower reliability is stored in less reliable parts of DNA molecules. We show that the proposed data mapping can be used to achieve graceful degradation in the presence of high error rates, or to implement the concept of approximate storage in DNA. All proposed mechanisms are seamlessly integrated into the state-of-the art DNA storage pipeline at zero storage overhead, validated through wetlab experiments, and evaluated on end-to-end encrypted and compressed data.

DNA正成为越来越有吸引力的数据存储介质，因为它提供了许多重要和独特的优势，最显著的是前所未有的耐用性和密度。虽然这项技术正在迅速发展，但读取和写入的高昂成本、DNA存储中发生错误的高频率和特殊性质对其采用构成了重大挑战。在这项工作中，我们提出了一个新的观察，即从任何类型的基于DNA的存储系统中成功恢复给定比特的概率高度依赖于其在DNA分子中的物理位置。换句话说，当用作存储介质时，DNA分子的某些部分明显比其他部分更可靠。我们表明，DNA分子不同部分之间的可靠性差异导致错误纠正资源的使用效率极低，并且通常使用的技术，如不等错误纠正，不能用于弥合DNA存储背景下不同位置之间的可靠性差距。然后我们提出两种方法来解决这个问题。第一种方法是通用的，适用于任何类型的数据;它以一种特殊的方式在DNA分子上分割数据和ECC码字，使错误的影响均匀地分布在不同的码字和分子上，有效地消除了底层存储介质的偏置，提高了抵御整个分子损失的弹性。第二种方法是特定于应用程序的，并通过使用应用程序感知的数据到DNA分子的映射来寻求利用潜在的可靠性偏差，这样需要更高可靠性的数据存储在更可靠的位置，而需要较低可靠性的数据存储在DNA分子的不太可靠的部分。我们表明，所提出的数据映射可以用于在高错误率的情况下实现优雅的退化，或者实现DNA中的近似存储概念。所有提出的机制都以零存储开销无缝集成到最先进的DNA存储管道中，通过湿实验室实验进行验证，并对端到端加密和压缩数据进行评估。

{"title":"Managing reliability skew in DNA storage","authors":"Dehui Lin, Yasamin Tabatabaee, Yash Pote, Djordje Jevdjic","doi":"10.1145/3470496.3527441","DOIUrl":"https://doi.org/10.1145/3470496.3527441","url":null,"abstract":"DNA is emerging as an increasingly attractive medium for data storage due to a number of important and unique advantages it offers, most notably the unprecedented durability and density. While the technology is evolving rapidly, the prohibitive cost of reads and writes, the high frequency and the peculiar nature of errors occurring in DNA storage pose a significant challenge to its adoption. In this work we make a novel observation that the probability of successful recovery of a given bit from any type of a DNA-based storage system highly depends on its physical location within the DNA molecule. In other words, when used as a storage medium, some parts of DNA molecules appear significantly more reliable than others. We show that large differences in reliability between different parts of DNA molecules lead to highly inefficient use of error-correction resources, and that commonly used techniques such as unequal error-correction cannot be used to bridge the reliability gap between different locations in the context of DNA storage. We then propose two approaches to address the problem. The first approach is general and applies to any types of data; it stripes the data and ECC codewords across DNA molecules in a particular fashion such that the effects of errors are spread out evenly across different codewords and molecules, effectively de-biasing the underlying storage medium and improving the resilience against losses of entire molecules. The second approach is application-specific, and seeks to leverage the underlying reliability bias by using application-aware mapping of data onto DNA molecules such that data that requires higher reliability is stored in more reliable locations, whereas data that needs lower reliability is stored in less reliable parts of DNA molecules. We show that the proposed data mapping can be used to achieve graceful degradation in the presence of high error rates, or to implement the concept of approximate storage in DNA. All proposed mechanisms are seamlessly integrated into the state-of-the art DNA storage pipeline at zero storage overhead, validated through wetlab experiments, and evaluated on end-to-end encrypted and compressed data.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114640855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Crescent: taming memory irregularities for accelerating deep point cloud analytics 新月:驯服内存不规则加速深度点云分析

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-04-22 DOI: 10.1145/3470496.3527395

Yu Feng, Gunnar Hammonds, Yiming Gan, Yuhao Zhu

3D perception in point clouds is transforming the perception ability of future intelligent machines. Point cloud algorithms, however, are plagued by irregular memory accesses, leading to massive inefficiencies in the memory sub-system, which bottlenecks the overall efficiency. This paper proposes Crescent, an algorithm-hardware co-design system that tames the irregularities in deep point cloud analytics while achieving high accuracy. To that end, we introduce two approximation techniques, approximate neighbor search and selectively bank conflict elision, that "regularize" the DRAM and SRAM memory accesses. Doing so, however, necessarily introduces accuracy loss, which we mitigate by a new network training procedure that integrates approximation into the network training process. In essence, our training procedure trains models that are conditioned upon a specific approximate setting and, thus, retain a high accuracy. Experiments show that Crescent doubles the performance and halves the energy consumption compared to an optimized baseline accelerator with < 1% accuracy loss. The code of our paper is available at: https://github.com/horizon-research/crescent.

点云中的三维感知正在改变未来智能机器的感知能力。然而，点云算法受到不规则内存访问的困扰，导致内存子系统的大量低效率，从而成为整体效率的瓶颈。本文提出了一种算法-硬件协同设计系统Crescent，该系统可以在实现高精度的同时驯服深度点云分析中的不规则性。为此，我们引入了两种近似技术，即近似邻居搜索和选择性银行冲突省略，以“规范”DRAM和SRAM内存访问。然而，这样做必然会引入精度损失，我们通过将近似集成到网络训练过程中的新的网络训练过程来减轻精度损失。从本质上讲，我们的训练过程训练的模型以特定的近似设置为条件，因此保持了很高的准确性。实验表明，与精度损失小于1%的优化基准加速器相比，Crescent的性能提高了一倍，能耗降低了一半。我们论文的代码在:https://github.com/horizon-research/crescent。

{"title":"Crescent: taming memory irregularities for accelerating deep point cloud analytics","authors":"Yu Feng, Gunnar Hammonds, Yiming Gan, Yuhao Zhu","doi":"10.1145/3470496.3527395","DOIUrl":"https://doi.org/10.1145/3470496.3527395","url":null,"abstract":"3D perception in point clouds is transforming the perception ability of future intelligent machines. Point cloud algorithms, however, are plagued by irregular memory accesses, leading to massive inefficiencies in the memory sub-system, which bottlenecks the overall efficiency. This paper proposes Crescent, an algorithm-hardware co-design system that tames the irregularities in deep point cloud analytics while achieving high accuracy. To that end, we introduce two approximation techniques, approximate neighbor search and selectively bank conflict elision, that \"regularize\" the DRAM and SRAM memory accesses. Doing so, however, necessarily introduces accuracy loss, which we mitigate by a new network training procedure that integrates approximation into the network training process. In essence, our training procedure trains models that are conditioned upon a specific approximate setting and, thus, retain a high accuracy. Experiments show that Crescent doubles the performance and halves the energy consumption compared to an optimized baseline accelerator with < 1% accuracy loss. The code of our paper is available at: https://github.com/horizon-research/crescent.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122267283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Accelerating attention through gradient-based learned runtime pruning 通过基于梯度的学习运行时剪枝来加速注意力

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-04-07 DOI: 10.1145/3470496.3527423

Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang

Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).

自关注是各种基于变换的自然语言处理模型的最先进的准确性的关键促成因素。这种注意力机制计算出句子中每个单词与其他单词的相关分数。通常，只有一小部分单词与所关注的单词高度相关，这只能在运行时确定。因此，由于注意力得分低，大量的计算是无关紧要的，并且可能被修剪。主要的挑战是找到分数的阈值，低于这个阈值，后续的计算就不重要了。虽然这样的阈值是离散的，但本文通过一个集成到训练损失函数中的软可微正则器来表达它的搜索。该公式依赖于反向传播训练，同时对阈值和权值进行分析共优化，在精度和计算修剪之间取得了形式上的最优平衡。为了最好地利用这一数学创新，我们设计了一个位串行架构，称为LeOPArd，用于具有位级早期终止微架构机制的转换器语言模型。我们评估了MemN2N、BERT、ALBERT、GPT-2和Vision变压器模型的43个后端任务。布局后的结果表明，平均而言，LeOPArd分别产生1.9×and 3.9×speedup和能量降低，同时保持平均精度几乎不变(< 0.2%的下降)。

{"title":"Accelerating attention through gradient-based learned runtime pruning","authors":"Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang","doi":"10.1145/3470496.3527423","DOIUrl":"https://doi.org/10.1145/3470496.3527423","url":null,"abstract":"Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126553002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models Mokey:为开箱即用的浮点变压器模型启用窄定点推理

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-03-23 DOI: 10.1145/3470496.3527438

Ali Hadi Zadeh, M. Mahmoud, Ameer Abdelhadi, A. Moshovos

Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area- and energy-efficient hardware accelerator design. Over a set of state-of-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as memory compression assist for any other accelerator transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers.

越来越大和更好的变压器模型不断推进国家的最先进的精度和能力的自然语言处理应用程序。这些模型需要更多的计算能力、存储和能源。Mokey通过将所有值量化为4位索引到代表性16位定点质心的字典中，从而减少了最先进的32位或16位浮点转换器模型的占用空间。Mokey不需要微调，这是一个必不可少的功能，因为通常训练资源或数据集对许多人来说都是不可用的。利用变压器模型中自然出现的值范围，Mokey选择质心值来拟合指数曲线。这种独特的功能使Mokey能够用窄3b定点加法取代大部分原始的乘法累积操作，从而实现了面积和节能的硬件加速器设计。在一组最先进的变压器模型中，Mokey加速器提供了一个数量级的能效改进，同时根据模型和片上缓冲容量将性能提高至少4倍，最多可提高15倍。Mokey还可以作为内存压缩辅助工具，用于任何其他加速器透明地将宽浮点或定点激活或权重存储到窄的4位索引中。莫基证明优于先前的最先进的量化方法的变压器。

{"title":"Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models","authors":"Ali Hadi Zadeh, M. Mahmoud, Ameer Abdelhadi, A. Moshovos","doi":"10.1145/3470496.3527438","DOIUrl":"https://doi.org/10.1145/3470496.3527438","url":null,"abstract":"Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area- and energy-efficient hardware accelerator design. Over a set of state-of-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as memory compression assist for any other accelerator transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132842198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

EQC: ensembled quantum computing for variational quantum algorithms 变分量子算法的集成量子计算

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2021-11-29 DOI: 10.1145/3470496.3527434

S. Stein, Yufei Ding, N. Wiebe, Bo Peng, K. Kowalski, Nathan A. Baker, James Ang, A. Li

Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization. In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce ever-changing device-specific noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble configuration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide significant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC - a distributed gradient-based processor-performance-aware optimization system - that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training; (ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions; (iii) a weighting mechanism to adjust the quantum ensemble's computational contribution according to the systems' current performance. Evaluations comprising 500K times' circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5X on average (up to 86X and at least 5.2x). EQC is available at https://github.com/pnnl/eqc.

变分量子算法(VQA)由经典优化器和参数化量子电路组成，是在噪声中尺度量子(NISQ)时代最有前途的获取量子计算机能力的方法之一。然而，在现代NISQ设备上部署vqa通常面临相当大的系统和时间相关噪声以及令人望而生畏的慢训练速度。另一方面，昂贵的配套资源和基础设施使得量子计算机极其热衷于高利用率。在本文中，我们提出了一种虚拟的方式来构建变分量子算法的量子后端:而不是依赖于单个物理设备，它往往会引入不断变化的设备特定噪声，随着时间自校准的增长，性能不太可靠，我们建议构成一个量子集成，它在一组物理设备上异步动态分配量子任务，并根据机器状态调整集成配置。除了减少与机器相关的噪声外，集成还可以为VQA训练提供显着的加速。基于这一思想，我们构建了一个新的VQA训练框架EQC——一个基于分布式梯度的处理器性能感知优化系统，它包括:(1)异步并行VQA协同训练的系统架构;(ii)用于评估电路输出质量的分析模型，包括其架构、编译和运行条件;(iii)根据系统当前性能调整量子系综计算贡献的加权机制。使用VQE和QAOA应用程序对10个IBMQ NISQ设备进行500K次电路评估的评估表明，EQC可以获得非常接近集成中最高性能设备的错误率，同时将训练速度平均提高10.5倍(最高可达86X，至少提高5.2倍)。EQC可在https://github.com/pnnl/eqc上获得。

{"title":"EQC: ensembled quantum computing for variational quantum algorithms","authors":"S. Stein, Yufei Ding, N. Wiebe, Bo Peng, K. Kowalski, Nathan A. Baker, James Ang, A. Li","doi":"10.1145/3470496.3527434","DOIUrl":"https://doi.org/10.1145/3470496.3527434","url":null,"abstract":"Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization. In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce ever-changing device-specific noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble configuration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide significant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC - a distributed gradient-based processor-performance-aware optimization system - that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training; (ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions; (iii) a weighting mechanism to adjust the quantum ensemble's computational contribution according to the systems' current performance. Evaluations comprising 500K times' circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5X on average (up to 86X and at least 5.2x). EQC is available at https://github.com/pnnl/eqc.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129885314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28