Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献

BioHD: an efficient genome sequence search platform using HyperDimensional memorization BioHD:使用超维记忆的高效基因组序列搜索平台

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527422

Zhuowen Zou, Hanning Chen, Prathyush Poduval, Yeseong Kim, Mahdi Imani, Elaheh Sadredini, Rosario Cammarota, M. Imani

In this paper, we propose BioHD, a novel genomic sequence searching platform based on Hyper-Dimensional Computing (HDC) for hardware-friendly computation. BioHD transforms inherent sequential processes of genome matching to highly-parallelizable computation tasks. We exploit HDC memorization to encode and represent the genome sequences using high-dimensional vectors. Then, it combines the genome sequences to generate an HDC reference library. During the sequence searching, BioHD performs exact or approximate similarity check of an encoded query with the HDC reference library. Our framework simplifies the required sequence matching operations while introducing a statistical model to control the alignment quality. To get actual advantage from BioHD inherent robustness and parallelism, we design a processing in-memory (PIM) architecture with massive parallelism and compatible with the existing crossbar memory. Our PIM architecture supports all essential BioHD operations natively in memory with minimal modification on the array. We evaluate BioHD accuracy and efficiency on a wide range of genomics data, including COVID-19 databases. Our results indicate that PIM provides 102.8× and 116.1× (9.3× and 13.2×) speedup and energy efficiency compared to the state-of-the-art pattern matching algorithm running on GeForce RTX 3060 Ti GPU (state-of-the-art PIM accelerator).

本文提出了一种基于超维计算(HDC)的新型基因组序列搜索平台BioHD，用于硬件友好的计算。BioHD将基因组匹配的固有顺序过程转换为高度并行化的计算任务。我们利用HDC记忆来编码和表示使用高维向量的基因组序列。然后，结合基因组序列生成HDC参考库。在序列搜索过程中，BioHD执行与HDC参考库编码查询的精确或近似相似性检查。我们的框架简化了所需的序列匹配操作，同时引入了一个统计模型来控制比对质量。为了充分发挥BioHD固有的鲁棒性和并行性的优势，我们设计了一种具有大规模并行性的内存处理(PIM)架构，并与现有的交叉棒存储器兼容。我们的PIM架构在内存中支持所有基本的BioHD操作，只需对阵列进行最小的修改。我们在包括COVID-19数据库在内的广泛基因组学数据上评估BioHD的准确性和效率。我们的研究结果表明，与运行在GeForce RTX 3060 Ti GPU(最先进的PIM加速器)上的最先进的模式匹配算法相比，PIM提供了102.8倍和116.1倍(9.3倍和13.2倍)的加速和能效。

{"title":"BioHD: an efficient genome sequence search platform using HyperDimensional memorization","authors":"Zhuowen Zou, Hanning Chen, Prathyush Poduval, Yeseong Kim, Mahdi Imani, Elaheh Sadredini, Rosario Cammarota, M. Imani","doi":"10.1145/3470496.3527422","DOIUrl":"https://doi.org/10.1145/3470496.3527422","url":null,"abstract":"In this paper, we propose BioHD, a novel genomic sequence searching platform based on Hyper-Dimensional Computing (HDC) for hardware-friendly computation. BioHD transforms inherent sequential processes of genome matching to highly-parallelizable computation tasks. We exploit HDC memorization to encode and represent the genome sequences using high-dimensional vectors. Then, it combines the genome sequences to generate an HDC reference library. During the sequence searching, BioHD performs exact or approximate similarity check of an encoded query with the HDC reference library. Our framework simplifies the required sequence matching operations while introducing a statistical model to control the alignment quality. To get actual advantage from BioHD inherent robustness and parallelism, we design a processing in-memory (PIM) architecture with massive parallelism and compatible with the existing crossbar memory. Our PIM architecture supports all essential BioHD operations natively in memory with minimal modification on the array. We evaluate BioHD accuracy and efficiency on a wide range of genomics data, including COVID-19 databases. Our results indicate that PIM provides 102.8× and 116.1× (9.3× and 13.2×) speedup and energy efficiency compared to the state-of-the-art pattern matching algorithm running on GeForce RTX 3060 Ti GPU (state-of-the-art PIM accelerator).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115101326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Hydra: enabling low-overhead mitigation of row-hammer at ultra-low thresholds via hybrid tracking Hydra:通过混合跟踪，在超低阈值下实现低开销的行锤缓解

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527421

Moinuddin K. Qureshi, Aditya Rohan, Gururaj Saileshwar, Prashant J. Nair

DRAM systems continue to be plagued by the Row-Hammer (RH) security vulnerability. The threshold number of row activations (TRH) required to induce RH has reduced rapidly from 139K in 2014 to 4.8K in 2020, and TRH is expected to reduce further, making RH even more severe for future DRAM. Therefore, solutions for mitigating RH should be effective not only at current TRH but also at future TRH. In this paper, we investigate the mitigation of RH at ultra-low thresholds (500 and below). At such thresholds, state-of-the-art solutions, which rely on SRAM or CAM for tracking row activations, incur impractical storage overheads (340KB or more per rank at TRH of 500), making such solutions unappealing for commercial adoption. Alternative solutions, which store per-row metadata in the addressable DRAM space, incur significant slowdown (25% on average) due to extra memory accesses, even in the presence of metadata caches. Our goal is to develop scalable RH mitigation while incurring low SRAM and performance overheads. To that end, this paper proposes Hydra, a Hybrid Tracker for RH mitigation, which combines the best of both SRAM and DRAM to enable low-cost mitigation of RH at ultra-low thresholds. Hydra consists of two structures. First, an SRAM-based structure that tracks aggregated counts at the granularity of a group of rows, and is sufficient for the vast majority of rows that receive only a few activations. Second, a per-row tracker stored in the DRAM-array, which can track an arbitrary number of rows, however, to limit performance overheads, this tracker is used only for the small number of rows that exceed the tracking capability of the SRAM-based structure. We provide a security analysis of Hydra to show that Hydra can reliably issue a mitigation within the specified threshold. Our evaluations show that Hydra enables robust mitigation of RH, while incurring an SRAM overhead of only 28 KB per-rank and an average slowdown of only 0.7% (at TRH of 500).

DRAM系统继续受到Row-Hammer (RH)安全漏洞的困扰。诱导RH所需的行激活阈值(TRH)从2014年的139K迅速降低到2020年的4.8K，并且TRH预计将进一步降低，使未来DRAM的RH更加严重。因此，缓解RH的解决方案不仅应该对当前的TRH有效，而且应该对未来的TRH有效。在本文中，我们研究了在超低阈值(500及以下)下RH的缓解。在这样的阈值下，依靠SRAM或CAM来跟踪行激活的最先进的解决方案会产生不切实际的存储开销(TRH为500时，每个秩的存储开销为340KB或更多)，使得此类解决方案不适合商业采用。在可寻址的DRAM空间中存储每行元数据的替代解决方案，即使在存在元数据缓存的情况下，也会由于额外的内存访问而导致显著的减速(平均25%)。我们的目标是开发可扩展的RH缓解，同时降低SRAM和性能开销。为此，本文提出了Hydra，一种用于RH缓解的混合跟踪器，它结合了SRAM和DRAM的优点，可以在超低阈值下实现低成本的RH缓解。九头蛇由两个结构组成。首先，基于sram的结构可以在一组行的粒度上跟踪聚合计数，并且对于只接收少量激活的绝大多数行已经足够了。第二种是存储在dram阵列中的逐行跟踪器，它可以跟踪任意数量的行，但是，为了限制性能开销，这种跟踪器仅用于超出基于sram结构的跟踪能力的少数行。我们提供了Hydra的安全性分析，以表明Hydra可以在指定的阈值内可靠地发出缓解。我们的评估表明，Hydra能够有效地降低相对湿度，同时每rank的SRAM开销仅为28 KB，平均速度仅为0.7% (TRH为500时)。

{"title":"Hydra: enabling low-overhead mitigation of row-hammer at ultra-low thresholds via hybrid tracking","authors":"Moinuddin K. Qureshi, Aditya Rohan, Gururaj Saileshwar, Prashant J. Nair","doi":"10.1145/3470496.3527421","DOIUrl":"https://doi.org/10.1145/3470496.3527421","url":null,"abstract":"DRAM systems continue to be plagued by the Row-Hammer (RH) security vulnerability. The threshold number of row activations (TRH) required to induce RH has reduced rapidly from 139K in 2014 to 4.8K in 2020, and TRH is expected to reduce further, making RH even more severe for future DRAM. Therefore, solutions for mitigating RH should be effective not only at current TRH but also at future TRH. In this paper, we investigate the mitigation of RH at ultra-low thresholds (500 and below). At such thresholds, state-of-the-art solutions, which rely on SRAM or CAM for tracking row activations, incur impractical storage overheads (340KB or more per rank at TRH of 500), making such solutions unappealing for commercial adoption. Alternative solutions, which store per-row metadata in the addressable DRAM space, incur significant slowdown (25% on average) due to extra memory accesses, even in the presence of metadata caches. Our goal is to develop scalable RH mitigation while incurring low SRAM and performance overheads. To that end, this paper proposes Hydra, a Hybrid Tracker for RH mitigation, which combines the best of both SRAM and DRAM to enable low-cost mitigation of RH at ultra-low thresholds. Hydra consists of two structures. First, an SRAM-based structure that tracks aggregated counts at the granularity of a group of rows, and is sufficient for the vast majority of rows that receive only a few activations. Second, a per-row tracker stored in the DRAM-array, which can track an arbitrary number of rows, however, to limit performance overheads, this tracker is used only for the small number of rows that exceed the tracking capability of the SRAM-based structure. We provide a security analysis of Hydra to show that Hydra can reliably issue a mitigation within the specified threshold. Our evaluations show that Hydra enables robust mitigation of RH, while incurring an SRAM overhead of only 28 KB per-rank and an average slowdown of only 0.7% (at TRH of 500).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130397604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques 石墨:通过协同软硬件技术优化cpu上的图形神经网络

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527403

Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W. Fletcher, C. Hughes, J. Torrellas

Graph Neural Networks (GNNs) are becoming popular because they are effective at extracting information from graphs. To execute GNNs, CPUs are good platforms because of their high availability and terabyte-level memory capacity, which enables full-batch computation on large graphs. However, GNNs on CPUs are heavily memory bound, which limits their performance. In this paper, we address this problem by alleviating the stress of GNNs on memory with cooperative software-hardware techniques. Our software techniques include: (i) layer fusion that overlaps the memory-intensive phase and the compute-intensive phase in a GNN layer, (ii) feature compression that reduces memory traffic by exploiting the sparsity in the vertex feature vectors, and (iii) an algorithm that changes the processing order of vertices to improve temporal locality. On top of the software techniques, we enhance the CPUs' direct memory access (DMA) engines with the capability to execute the GNNs' memory-intensive phase, so that the processor cores can focus on the compute-intensive phase. We call the combination of our software and hardware techniques Graphite. We evaluate Graphite with popular GNN models on large graphs. The result is high-performance full-batch GNN training and inference on CPUs. Our software techniques outperform a state-of-the-art GNN layer implementation by 1.7--1.9x in inference and 1.6--2.6x in training. Our combined software and hardware techniques speedup inference by 1.6--2.0x and training by 1.9--3.1x.

图神经网络(gnn)由于能够有效地从图中提取信息而越来越受欢迎。为了执行gnn, cpu是很好的平台，因为它们具有高可用性和tb级的内存容量，可以在大型图形上进行全批计算。然而，cpu上的gnn是严重的内存绑定，这限制了它们的性能。在本文中，我们通过使用软硬件协作技术减轻gnn对内存的压力来解决这个问题。我们的软件技术包括:(i)在GNN层中重叠内存密集型阶段和计算密集型阶段的层融合，(ii)通过利用顶点特征向量的稀疏性来减少内存流量的特征压缩，以及(iii)改变顶点处理顺序以改善时间局域性的算法。在软件技术的基础上，我们增强了cpu的直接内存访问(DMA)引擎，使其能够执行gnn的内存密集型阶段，从而使处理器核心能够专注于计算密集型阶段。我们把软件和硬件技术的结合称为石墨。我们在大图上用流行的GNN模型评估石墨。结果是在cpu上实现了高性能的全批GNN训练和推理。我们的软件技术在推理方面比最先进的GNN层实现高出1.7- 1.9倍，在训练方面高出1.6- 2.6倍。我们结合了软件和硬件技术，推理速度提高了1.6- 2.0倍，训练速度提高了1.9- 3.1倍。

{"title":"Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques","authors":"Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W. Fletcher, C. Hughes, J. Torrellas","doi":"10.1145/3470496.3527403","DOIUrl":"https://doi.org/10.1145/3470496.3527403","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming popular because they are effective at extracting information from graphs. To execute GNNs, CPUs are good platforms because of their high availability and terabyte-level memory capacity, which enables full-batch computation on large graphs. However, GNNs on CPUs are heavily memory bound, which limits their performance. In this paper, we address this problem by alleviating the stress of GNNs on memory with cooperative software-hardware techniques. Our software techniques include: (i) layer fusion that overlaps the memory-intensive phase and the compute-intensive phase in a GNN layer, (ii) feature compression that reduces memory traffic by exploiting the sparsity in the vertex feature vectors, and (iii) an algorithm that changes the processing order of vertices to improve temporal locality. On top of the software techniques, we enhance the CPUs' direct memory access (DMA) engines with the capability to execute the GNNs' memory-intensive phase, so that the processor cores can focus on the compute-intensive phase. We call the combination of our software and hardware techniques Graphite. We evaluate Graphite with popular GNN models on large graphs. The result is high-performance full-batch GNN training and inference on CPUs. Our software techniques outperform a state-of-the-art GNN layer implementation by 1.7--1.9x in inference and 1.6--2.6x in training. Our combined software and hardware techniques speedup inference by 1.6--2.0x and training by 1.9--3.1x.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121913009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data 一个硬件加速器，用于对加密数据进行有效的无界计算

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527393

Nikola Samardzic, Axel Feldmann, A. Krastev, Nathan Manohar, N. Genise, S. Devadas, Karim M. El Defrawy, Chris Peikert, Daniel Sánchez

Fully Homomorphic Encryption (FHE) enables offloading computation to untrusted servers with cryptographic privacy. Despite its attractive security, FHE is not yet widely adopted due to its prohibitive overheads, about 10,000X over unencrypted computation. Recent FHE accelerators have made strides to bridge this performance gap. Unfortunately, prior accelerators only work well for simple programs, but become inefficient for complex programs, which bring additional costs and challenges. We present CraterLake, the first FHE accelerator that enables FHE programs of unbounded size (i.e., unbounded multiplicative depth). Such computations require very large ciphertexts (tens of MBs each) and different algorithms that prior work does not support well. To tackle this challenge, CraterLake introduces a new hardware architecture that efficiently scales to very large cipher-texts, novel functional units to accelerate key kernels, and new algorithms and compiler techniques to reduce data movement. We evaluate CraterLake on deep FHE programs, including deep neural networks like ResNet and LSTMs, where prior work takes minutes to hours per inference on a CPU. CraterLake outperforms a CPU by gmean 4,600X and the best prior FHE accelerator by 11.2X under similar area and power budgets. These speeds enable realtime performance on unbounded FHE programs for the first time.

完全同态加密(FHE)允许将计算卸载到具有加密隐私的不受信任的服务器。尽管FHE具有吸引力的安全性，但由于其令人望而却步的开销，大约是未加密计算的10,000倍，因此尚未被广泛采用。最近的FHE加速器在弥合这一性能差距方面取得了长足的进步。不幸的是，以前的加速器只适用于简单的程序，但对于复杂的程序却效率低下，这带来了额外的成本和挑战。我们提出陨石坑湖，第一个FHE加速器，使FHE程序无界的大小(即，无界的乘法深度)。这样的计算需要非常大的密文(每个密文几十mb)和不同的算法，而以前的工作并不支持这些算法。为了应对这一挑战，CraterLake引入了一种新的硬件架构，可以有效地扩展到非常大的加密文本，新的功能单元可以加速关键内核，新的算法和编译器技术可以减少数据移动。我们在深度FHE程序上评估了CraterLake，包括ResNet和lstm等深度神经网络，在这些程序中，CPU上的每次推理需要几分钟到几小时。在类似的面积和功耗预算下，CraterLake的性能比CPU平均高出4,600倍，比之前最好的FHE加速器高出11.2倍。这些速度首次实现了无限FHE程序的实时性能。

{"title":"CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data","authors":"Nikola Samardzic, Axel Feldmann, A. Krastev, Nathan Manohar, N. Genise, S. Devadas, Karim M. El Defrawy, Chris Peikert, Daniel Sánchez","doi":"10.1145/3470496.3527393","DOIUrl":"https://doi.org/10.1145/3470496.3527393","url":null,"abstract":"Fully Homomorphic Encryption (FHE) enables offloading computation to untrusted servers with cryptographic privacy. Despite its attractive security, FHE is not yet widely adopted due to its prohibitive overheads, about 10,000X over unencrypted computation. Recent FHE accelerators have made strides to bridge this performance gap. Unfortunately, prior accelerators only work well for simple programs, but become inefficient for complex programs, which bring additional costs and challenges. We present CraterLake, the first FHE accelerator that enables FHE programs of unbounded size (i.e., unbounded multiplicative depth). Such computations require very large ciphertexts (tens of MBs each) and different algorithms that prior work does not support well. To tackle this challenge, CraterLake introduces a new hardware architecture that efficiently scales to very large cipher-texts, novel functional units to accelerate key kernels, and new algorithms and compiler techniques to reduce data movement. We evaluate CraterLake on deep FHE programs, including deep neural networks like ResNet and LSTMs, where prior work takes minutes to hours per inference on a CPU. CraterLake outperforms a CPU by gmean 4,600X and the best prior FHE accelerator by 11.2X under similar area and power budgets. These speeds enable realtime performance on unbounded FHE programs for the first time.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128519881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

MeNDA: a near-memory multi-way merge solution for sparse transposition and dataflows MeNDA:一种用于稀疏换位和数据流的近内存多路合并解决方案

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527432

Siying Feng, Xin He, Kuan-Yu Chen, Liu Ke, Xuan Zhang, D. Blaauw, T. Mudge, R. Dreslinski

Near-memory processing has been extensively studied to optimize memory intensive workloads. However, none of the proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. Prior work shows that sparse matrix transposition does not scale as well as other sparse primitives such as sparse matrix vector multiplication (SpMV) and hence has become a growing bottleneck in common applications. Sparse matrix transposition is highly memory intensive but low in computational intensity, making it a promising candidate for near-memory processing. In this work, we propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. MeNDA adopts a merge sort based algorithm, exploiting spatial locality, and proposes a near-memory processing unit (PU) featuring a high-performance hardware merge tree. Because of the wide application of merge sort in sparse linear algebra, MeNDA is an extensible solution that can be easily adapted to support other sparse primitives such as SpMV. Techniques including seamless back-to-back merge sort, stall reducing prefetching and request coalescing are further explored to take full advantage of the increased system memory bandwidth. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1X, 12.0X, and 7.7x, respectively. MeNDA also shows an efficiency gain of 3.8x over a recent SpMV accelerator integrated with HBM. Incurring a power consumption of only 78.6 mW, a MeNDA PU can be easily accommodated by commodity DIMMs.

近内存处理已被广泛研究，以优化内存密集型工作负载。然而，这些设计都没有解决稀疏矩阵的变换问题，而这是稀疏线性代数应用中的一个重要组成部分。先前的研究表明，稀疏矩阵转置不像其他稀疏原语(如稀疏矩阵向量乘法(SpMV))那样具有良好的可伸缩性，因此已成为常见应用中日益增长的瓶颈。稀疏矩阵变换具有较高的内存占用率和较低的计算强度，是一种很有前途的近内存处理方法。在这项工作中，我们提出了一种可扩展的接近dram的多路合并加速器MeNDA，它消除了片外存储器接口瓶颈，并暴露了高内部存储器带宽，以提高稀疏矩阵转置的性能并降低能耗。MeNDA采用基于归并排序的算法，利用空间局部性，提出了一种具有高性能硬件归并树的近内存处理单元(PU)。由于归并排序在稀疏线性代数中的广泛应用，MeNDA是一种可扩展的解决方案，可以很容易地适应于支持其他稀疏原语(如SpMV)。技术包括无缝背靠背合并排序，减少失速预取和请求合并进一步探索，以充分利用增加的系统内存带宽。与CPU上的稀疏矩阵转置和GPU上的稀疏库的两种最先进的实现相比，MeNDA能够分别实现19.1X、12.0X和7.7x的加速。与最近集成HBM的SpMV加速器相比，MeNDA的效率提高了3.8倍。功耗仅为78.6 mW，门达PU可以很容易地被商品内存条容纳。

{"title":"MeNDA: a near-memory multi-way merge solution for sparse transposition and dataflows","authors":"Siying Feng, Xin He, Kuan-Yu Chen, Liu Ke, Xuan Zhang, D. Blaauw, T. Mudge, R. Dreslinski","doi":"10.1145/3470496.3527432","DOIUrl":"https://doi.org/10.1145/3470496.3527432","url":null,"abstract":"Near-memory processing has been extensively studied to optimize memory intensive workloads. However, none of the proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. Prior work shows that sparse matrix transposition does not scale as well as other sparse primitives such as sparse matrix vector multiplication (SpMV) and hence has become a growing bottleneck in common applications. Sparse matrix transposition is highly memory intensive but low in computational intensity, making it a promising candidate for near-memory processing. In this work, we propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. MeNDA adopts a merge sort based algorithm, exploiting spatial locality, and proposes a near-memory processing unit (PU) featuring a high-performance hardware merge tree. Because of the wide application of merge sort in sparse linear algebra, MeNDA is an extensible solution that can be easily adapted to support other sparse primitives such as SpMV. Techniques including seamless back-to-back merge sort, stall reducing prefetching and request coalescing are further explored to take full advantage of the increased system memory bandwidth. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1X, 12.0X, and 7.7x, respectively. MeNDA also shows an efficiency gain of 3.8x over a recent SpMV accelerator integrated with HBM. Incurring a power consumption of only 78.6 mW, a MeNDA PU can be easily accommodated by commodity DIMMs.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116114738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

To PIM or not for emerging general purpose processing in DDR memory systems 对于DDR内存系统中出现的通用处理，是否使用PIM

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527431

Alexandar Devic, S. Rai, A. Sivasubramaniam, Ameen Akel, S. Eilert, Justin Eno

As Processing-In-Memory (PIM) hardware matures and starts making its way into normal compute platforms, software has an important role to play in determining what to perform where, and when, on such heterogeneous systems. Taking an emerging class of PIM hardware which provisions a general purpose (RISC-V) processor at each memory bank, this paper takes on this challenging problem by developing a software compilation framework. This framework analyzes several application characteristics - parallelizability, vectorizability, data set sizes, and offload costs - to determine what, whether, when and how to offload computations to the PIM engines. In the process, it also proposes a vector engine extension to the bank-level RISC-V cores. Using several off-the-shelf C/C++ applications, we demonstrate that PIM is not always a panacea, and a framework such as ours is essential in carefully selecting what needs to be performed where, when and how. The choice of hardware platforms - number of memory banks, relative speeds and capabilities of host CPU and PIM cores, can further impact the "to PIM or not" question.

随着内存中处理(PIM)硬件的成熟并开始进入正常的计算平台，软件在确定何时何地在此类异构系统上执行什么方面扮演着重要的角色。采用一种新兴的PIM硬件，该硬件在每个存储库上提供通用(RISC-V)处理器，本文通过开发软件编译框架来解决这一具有挑战性的问题。这个框架分析了几个应用程序特征——并行性、向量化、数据集大小和卸载成本——以确定什么、是否、何时以及如何将计算卸载给PIM引擎。在此过程中，还提出了对银行级RISC-V内核的向量引擎扩展。通过使用几个现成的C/ c++应用程序，我们证明了PIM并不总是万能的，像我们这样的框架对于仔细选择需要在何时何地以何种方式执行什么是必不可少的。硬件平台的选择——内存库的数量、主机CPU和PIM内核的相对速度和能力——会进一步影响“是否使用PIM”的问题。

{"title":"To PIM or not for emerging general purpose processing in DDR memory systems","authors":"Alexandar Devic, S. Rai, A. Sivasubramaniam, Ameen Akel, S. Eilert, Justin Eno","doi":"10.1145/3470496.3527431","DOIUrl":"https://doi.org/10.1145/3470496.3527431","url":null,"abstract":"As Processing-In-Memory (PIM) hardware matures and starts making its way into normal compute platforms, software has an important role to play in determining what to perform where, and when, on such heterogeneous systems. Taking an emerging class of PIM hardware which provisions a general purpose (RISC-V) processor at each memory bank, this paper takes on this challenging problem by developing a software compilation framework. This framework analyzes several application characteristics - parallelizability, vectorizability, data set sizes, and offload costs - to determine what, whether, when and how to offload computations to the PIM engines. In the process, it also proposes a vector engine extension to the bank-level RISC-V cores. Using several off-the-shelf C/C++ applications, we demonstrate that PIM is not always a panacea, and a framework such as ours is essential in carefully selecting what needs to be performed where, when and how. The choice of hardware platforms - number of memory banks, relative speeds and capabilities of host CPU and PIM cores, can further impact the \"to PIM or not\" question.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132929226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

INSPIRE: in-storage private information retrieval via protocol and architecture co-design INSPIRE:基于协议和架构协同设计的存储私有信息检索

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-18 DOI: 10.1145/3470496.3527433

Jilan Lin, Ling Liang, Zheng Qu, Ishtiyaque Ahmad, L. Liu, Fengbin Tu, Trinabh Gupta, Yufei Ding, Yuan Xie

Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment. In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We first design the INSPIRE protocol with a multi-stage filtering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE's protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous in-storage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme.

私有信息检索(PIR)在安全的、以数据库为中心的应用程序中起着至关重要的作用。然而，现有的PIR协议探索包含数百gb查询和数据库数据的巨大工作空间。因此，PIR性能受到存储通信的严重限制，使其在实际部署中远远不实用。在这项工作中，我们描述了一个用于存储私有信息检索的加速器INSPIRE。INSPIRE遵循协议和架构协同设计方法。我们首先设计了具有多阶段过滤机制的INSPIRE协议，实现了恒定的PIR查询大小。对于大小为288GiB的10亿个条目的数据库，INSPIRE的协议将查询大小从27GiB减少到3.6MiB。此外，我们提出了INSPIRE硬件，这是一种异构存储架构，它在SSD层次结构中集成了我们的协议。与INSPIRE协议一起，与最先进的FastPIR方案相比，INSPIRE硬件将查询时间从28.4分钟减少到36秒。

引用次数: 12

FFCCD

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527406

Yuanchao Xu, Chencheng Ye, Yan Solihin, Xipeng Shen

Persistent Memory (PM) is increasingly supplementing or substituting DRAM as main memory. Prior work have focused on reusability and memory leaks of persistent memory but have not addressed a problem amplified by persistence, persistent memory fragmentation, which refers to the continuous worsening of fragmentation of persistent memory throughout its usage. This paper reveals the challenges and proposes the first systematic crash-consistent solution, Fence-Free Crash-consistent Concurrent Defragmentation (FFCCD). FFCCD resues persistent pointer format, root nodes and typed allocation provided by persistent memory programming model to enable concurrent defragmentation on PM. FFCCD introduces architecture support for concurrent defragmentation that enables a fence-free design and fast read barrier, reducing two major overheads of defragmenting persistent memory. The techniques is effective (28--73% fragmentation reduction) and fast (4.1% execution time overhead).

引用次数: 3

MOESI-prime: preventing coherence-induced hammering in commodity workloads MOESI-prime:在商品工作负载中防止一致性引起的锤击

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527427

Kevin Loughlin, S. Saroiu, A. Wolman, Yatin A. Manerkar, Baris Kasikci

Prior work shows that Rowhammer attacks---which flip bits in DRAM via frequent activations of the same row(s)---are viable. Adversaries typically mount these attacks via instruction sequences that are carefully-crafted to bypass CPU caches. However, we discover a novel form of hammering that we refer to as coherence-induced hammering, caused by Intel's implementations of cache coherent non-uniform memory access (ccNUMA) protocols. We show that this hammering occurs in commodity benchmarks on a major cloud provider's production hardware, the first hammering found to be generated by non-malicious code. Given DRAM's rising susceptibility to bit flips, it is paramount to prevent coherence-induced hammering to ensure reliability and security in the cloud. Accordingly, we introduce MOESI-prime, a ccNUMA coherence protocol that mitigates coherence-induced hammering while retaining Intel's state-of-the-art scalability. MOESI-prime shows that most DRAM reads and writes triggering such hammering are unnecessary. Thus, by encoding additional information in the coherence protocol, MOESI-prime can omit these reads and writes, preventing coherence-induced hammering in non-malicious and malicious workloads. Furthermore, by omitting unnecessary reads and writes, MOESI-prime has negligible effect on average performance (within ±0.61% of MESI and MOESI) and average DRAM power (0.03%-0.22% improvement) across evaluated ccNUMA configurations.

先前的研究表明，Rowhammer攻击——通过频繁激活同一行来翻转DRAM中的位——是可行的。攻击者通常通过精心设计的指令序列来绕过CPU缓存进行攻击。然而，我们发现了一种新的锤击形式，我们称之为一致性诱发锤击，这是由英特尔实现的缓存相干非统一内存访问(ccNUMA)协议引起的。我们表明，这种锤击发生在主要云提供商的生产硬件的商品基准测试中，发现的第一个锤击是由非恶意代码生成的。鉴于DRAM对比特翻转的敏感性不断上升，防止相干性诱发的锤击是确保云中的可靠性和安全性的重中之重。因此，我们引入了MOESI-prime，这是一种ccNUMA相干协议，可以减轻相干性引起的锤击，同时保留英特尔最先进的可扩展性。MOESI-prime表明大多数触发这种锤击的DRAM读写是不必要的。因此，通过在相干协议中编码额外的信息，MOESI-prime可以省略这些读写操作，从而在非恶意和恶意工作负载中防止由相干引起的锤击。此外，通过省略不必要的读写，MOESI-prime对ccNUMA配置的平均性能(MESI和MOESI的±0.61%)和平均DRAM功耗(0.03%-0.22%的改进)的影响可以忽略不计。

{"title":"MOESI-prime: preventing coherence-induced hammering in commodity workloads","authors":"Kevin Loughlin, S. Saroiu, A. Wolman, Yatin A. Manerkar, Baris Kasikci","doi":"10.1145/3470496.3527427","DOIUrl":"https://doi.org/10.1145/3470496.3527427","url":null,"abstract":"Prior work shows that Rowhammer attacks---which flip bits in DRAM via frequent activations of the same row(s)---are viable. Adversaries typically mount these attacks via instruction sequences that are carefully-crafted to bypass CPU caches. However, we discover a novel form of hammering that we refer to as coherence-induced hammering, caused by Intel's implementations of cache coherent non-uniform memory access (ccNUMA) protocols. We show that this hammering occurs in commodity benchmarks on a major cloud provider's production hardware, the first hammering found to be generated by non-malicious code. Given DRAM's rising susceptibility to bit flips, it is paramount to prevent coherence-induced hammering to ensure reliability and security in the cloud. Accordingly, we introduce MOESI-prime, a ccNUMA coherence protocol that mitigates coherence-induced hammering while retaining Intel's state-of-the-art scalability. MOESI-prime shows that most DRAM reads and writes triggering such hammering are unnecessary. Thus, by encoding additional information in the coherence protocol, MOESI-prime can omit these reads and writes, preventing coherence-induced hammering in non-malicious and malicious workloads. Furthermore, by omitting unnecessary reads and writes, MOESI-prime has negligible effect on average performance (within ±0.61% of MESI and MOESI) and average DRAM power (0.03%-0.22% improvement) across evaluated ccNUMA configurations.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126664969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Axiomatic hardware-software contracts for security 公理硬件软件契约的安全性

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527412

Nicholas Mosier, Hanna Lachnitt, Hamed Nemati, Caroline Trippel

We propose leakage containment models (LCMs)---novel axiomatic security contracts which support formally reasoning about the security guarantees of programs when they run on particular microarchitectures. Our core contribution is an axiomatic vocabulary for formalizing LCMs, derived from the established axiomatic vocabulary for formalizing processor memory consistency models. Using this vocabulary, we formalize microarchitectural leakage---focusing on leakage through hardware memory systems---so that it can be automatically detected in programs and provide a taxonomy for classifying said leakage by severity. To illustrate the efficacy of LCMs, we first demonstrate that our leakage definition faithfully captures a sampling of (transient and non-transient) microarchitectural attacks from the literature. Second, we develop a static analysis tool based on LCMs which automatically identifies Spectre vulnerabilities in programs and scales to analyze real-world crypto-libraries.

我们提出了泄漏遏制模型(lcm)——一种新的公理安全契约，它支持对程序在特定微架构上运行时的安全保证进行正式推理。我们的核心贡献是一个用于形式化lcm的公理词汇表，它来源于已建立的用于形式化处理器内存一致性模型的公理词汇表。使用这个词汇表，我们将微体系结构泄漏形式化——关注通过硬件内存系统的泄漏——这样就可以在程序中自动检测到它，并提供一种按严重程度对泄漏进行分类的分类法。为了说明lcm的有效性，我们首先证明我们的泄漏定义忠实地捕获了文献中(瞬态和非瞬态)微架构攻击的样本。其次，我们开发了一个基于lcm的静态分析工具，该工具可以自动识别程序和规模中的Spectre漏洞，以分析现实世界的加密库。

引用次数: 14