Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna
{"title":"Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching","authors":"Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna","doi":"arxiv-2401.06362","DOIUrl":null,"url":null,"abstract":"Attention-based Neural Networks (NN) have demonstrated their effectiveness in\naccurate memory access prediction, an essential step in data prefetching.\nHowever, the substantial computational overheads associated with these models\nresult in high inference latency, limiting their feasibility as practical\nprefetchers. To close the gap, we propose a new approach based on\ntabularization that significantly reduces model complexity and inference\nlatency without sacrificing prediction accuracy. Our novel tabularization\nmethodology takes as input a distilled, yet highly accurate attention-based\nmodel for memory access prediction and efficiently converts its expensive\nmatrix multiplications into a hierarchy of fast table lookups. As an exemplar\nof the above approach, we develop DART, a prefetcher comprised of a simple\nhierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99%\nof arithmetic operations from the large attention-based model and 91.83% from\nthe distilled model. DART accelerates the large model inference by 170x and the\ndistilled model by 9.4x. DART has comparable latency and storage costs as\nstate-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC\nimprovement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art\nNN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC\nimprovement, primarily due to its low prefetching latency.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.06362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Attention-based Neural Networks (NN) have demonstrated their effectiveness in
accurate memory access prediction, an essential step in data prefetching.
However, the substantial computational overheads associated with these models
result in high inference latency, limiting their feasibility as practical
prefetchers. To close the gap, we propose a new approach based on
tabularization that significantly reduces model complexity and inference
latency without sacrificing prediction accuracy. Our novel tabularization
methodology takes as input a distilled, yet highly accurate attention-based
model for memory access prediction and efficiently converts its expensive
matrix multiplications into a hierarchy of fast table lookups. As an exemplar
of the above approach, we develop DART, a prefetcher comprised of a simple
hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99%
of arithmetic operations from the large attention-based model and 91.83% from
the distilled model. DART accelerates the large model inference by 170x and the
distilled model by 9.4x. DART has comparable latency and storage costs as
state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC
improvement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art
NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC
improvement, primarily due to its low prefetching latency.