Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

arXiv - CS - Operating Systems Pub Date : 2023-12-23 DOI:arxiv-2401.06362

Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna

{"title":"Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching","authors":"Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna","doi":"arxiv-2401.06362","DOIUrl":null,"url":null,"abstract":"Attention-based Neural Networks (NN) have demonstrated their effectiveness in\naccurate memory access prediction, an essential step in data prefetching.\nHowever, the substantial computational overheads associated with these models\nresult in high inference latency, limiting their feasibility as practical\nprefetchers. To close the gap, we propose a new approach based on\ntabularization that significantly reduces model complexity and inference\nlatency without sacrificing prediction accuracy. Our novel tabularization\nmethodology takes as input a distilled, yet highly accurate attention-based\nmodel for memory access prediction and efficiently converts its expensive\nmatrix multiplications into a hierarchy of fast table lookups. As an exemplar\nof the above approach, we develop DART, a prefetcher comprised of a simple\nhierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99%\nof arithmetic operations from the large attention-based model and 91.83% from\nthe distilled model. DART accelerates the large model inference by 170x and the\ndistilled model by 9.4x. DART has comparable latency and storage costs as\nstate-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC\nimprovement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art\nNN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC\nimprovement, primarily due to its low prefetching latency.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.06362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关注、提炼和表格化：基于神经网络的实用预取

基于注意力的神经网络（NN）在不准确的内存访问预测（数据预取的一个重要步骤）方面已经证明了其有效性。然而，与这些模型相关的大量计算开销导致了较高的推理延迟，限制了它们作为实用预取器的可行性。为了缩小差距，我们提出了一种基于表格化的新方法，它能在不牺牲预测准确性的前提下显著降低模型复杂度和推理延迟。我们新颖的表格化方法将经过提炼但高度精确的基于注意力的内存访问预测模型作为输入，并将其昂贵的矩阵乘法有效地转换为分层的快速表格查找。作为上述方法的范例，我们开发了 DART，一种由简单的表层次结构组成的预取器。在 F1 分数下降 0.09 的情况下，DART 从基于注意力的大型模型中减少了 99.99% 的算术运算，从经过提炼的模型中减少了 91.83%。DART 将大型模型的推理速度提高了 170 倍，将蒸馏模型的推理速度提高了 9.4 倍。DART 的延迟和存储成本与最先进的基于规则的预取器 BO 不相上下，但在 IPC 提升方面却比它高出 6.1%，速度提高了 37.6%。在 IPCimprovement 方面，DART 比基于最新网络的预取器 TransFetch 快 33.1%，比 Voyager 快 37.2%，这主要归功于其较低的预取延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Operating Systems

自引率

0.00%

发文量

期刊最新文献

Analysis of Synchronization Mechanisms in Operating Systems Skip TLB flushes for reused pages within mmap's eBPF-mm: Userspace-guided memory management in Linux with eBPF BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects