Zhuowen Zou, Hanning Chen, Prathyush Poduval, Yeseong Kim, Mahdi Imani, Elaheh Sadredini, Rosario Cammarota, M. Imani
In this paper, we propose BioHD, a novel genomic sequence searching platform based on Hyper-Dimensional Computing (HDC) for hardware-friendly computation. BioHD transforms inherent sequential processes of genome matching to highly-parallelizable computation tasks. We exploit HDC memorization to encode and represent the genome sequences using high-dimensional vectors. Then, it combines the genome sequences to generate an HDC reference library. During the sequence searching, BioHD performs exact or approximate similarity check of an encoded query with the HDC reference library. Our framework simplifies the required sequence matching operations while introducing a statistical model to control the alignment quality. To get actual advantage from BioHD inherent robustness and parallelism, we design a processing in-memory (PIM) architecture with massive parallelism and compatible with the existing crossbar memory. Our PIM architecture supports all essential BioHD operations natively in memory with minimal modification on the array. We evaluate BioHD accuracy and efficiency on a wide range of genomics data, including COVID-19 databases. Our results indicate that PIM provides 102.8× and 116.1× (9.3× and 13.2×) speedup and energy efficiency compared to the state-of-the-art pattern matching algorithm running on GeForce RTX 3060 Ti GPU (state-of-the-art PIM accelerator).
本文提出了一种基于超维计算(HDC)的新型基因组序列搜索平台BioHD,用于硬件友好的计算。BioHD将基因组匹配的固有顺序过程转换为高度并行化的计算任务。我们利用HDC记忆来编码和表示使用高维向量的基因组序列。然后,结合基因组序列生成HDC参考库。在序列搜索过程中,BioHD执行与HDC参考库编码查询的精确或近似相似性检查。我们的框架简化了所需的序列匹配操作,同时引入了一个统计模型来控制比对质量。为了充分发挥BioHD固有的鲁棒性和并行性的优势,我们设计了一种具有大规模并行性的内存处理(PIM)架构,并与现有的交叉棒存储器兼容。我们的PIM架构在内存中支持所有基本的BioHD操作,只需对阵列进行最小的修改。我们在包括COVID-19数据库在内的广泛基因组学数据上评估BioHD的准确性和效率。我们的研究结果表明,与运行在GeForce RTX 3060 Ti GPU(最先进的PIM加速器)上的最先进的模式匹配算法相比,PIM提供了102.8倍和116.1倍(9.3倍和13.2倍)的加速和能效。
{"title":"BioHD: an efficient genome sequence search platform using HyperDimensional memorization","authors":"Zhuowen Zou, Hanning Chen, Prathyush Poduval, Yeseong Kim, Mahdi Imani, Elaheh Sadredini, Rosario Cammarota, M. Imani","doi":"10.1145/3470496.3527422","DOIUrl":"https://doi.org/10.1145/3470496.3527422","url":null,"abstract":"In this paper, we propose BioHD, a novel genomic sequence searching platform based on Hyper-Dimensional Computing (HDC) for hardware-friendly computation. BioHD transforms inherent sequential processes of genome matching to highly-parallelizable computation tasks. We exploit HDC memorization to encode and represent the genome sequences using high-dimensional vectors. Then, it combines the genome sequences to generate an HDC reference library. During the sequence searching, BioHD performs exact or approximate similarity check of an encoded query with the HDC reference library. Our framework simplifies the required sequence matching operations while introducing a statistical model to control the alignment quality. To get actual advantage from BioHD inherent robustness and parallelism, we design a processing in-memory (PIM) architecture with massive parallelism and compatible with the existing crossbar memory. Our PIM architecture supports all essential BioHD operations natively in memory with minimal modification on the array. We evaluate BioHD accuracy and efficiency on a wide range of genomics data, including COVID-19 databases. Our results indicate that PIM provides 102.8× and 116.1× (9.3× and 13.2×) speedup and energy efficiency compared to the state-of-the-art pattern matching algorithm running on GeForce RTX 3060 Ti GPU (state-of-the-art PIM accelerator).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115101326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moinuddin K. Qureshi, Aditya Rohan, Gururaj Saileshwar, Prashant J. Nair
DRAM systems continue to be plagued by the Row-Hammer (RH) security vulnerability. The threshold number of row activations (TRH) required to induce RH has reduced rapidly from 139K in 2014 to 4.8K in 2020, and TRH is expected to reduce further, making RH even more severe for future DRAM. Therefore, solutions for mitigating RH should be effective not only at current TRH but also at future TRH. In this paper, we investigate the mitigation of RH at ultra-low thresholds (500 and below). At such thresholds, state-of-the-art solutions, which rely on SRAM or CAM for tracking row activations, incur impractical storage overheads (340KB or more per rank at TRH of 500), making such solutions unappealing for commercial adoption. Alternative solutions, which store per-row metadata in the addressable DRAM space, incur significant slowdown (25% on average) due to extra memory accesses, even in the presence of metadata caches. Our goal is to develop scalable RH mitigation while incurring low SRAM and performance overheads. To that end, this paper proposes Hydra, a Hybrid Tracker for RH mitigation, which combines the best of both SRAM and DRAM to enable low-cost mitigation of RH at ultra-low thresholds. Hydra consists of two structures. First, an SRAM-based structure that tracks aggregated counts at the granularity of a group of rows, and is sufficient for the vast majority of rows that receive only a few activations. Second, a per-row tracker stored in the DRAM-array, which can track an arbitrary number of rows, however, to limit performance overheads, this tracker is used only for the small number of rows that exceed the tracking capability of the SRAM-based structure. We provide a security analysis of Hydra to show that Hydra can reliably issue a mitigation within the specified threshold. Our evaluations show that Hydra enables robust mitigation of RH, while incurring an SRAM overhead of only 28 KB per-rank and an average slowdown of only 0.7% (at TRH of 500).
{"title":"Hydra: enabling low-overhead mitigation of row-hammer at ultra-low thresholds via hybrid tracking","authors":"Moinuddin K. Qureshi, Aditya Rohan, Gururaj Saileshwar, Prashant J. Nair","doi":"10.1145/3470496.3527421","DOIUrl":"https://doi.org/10.1145/3470496.3527421","url":null,"abstract":"DRAM systems continue to be plagued by the Row-Hammer (RH) security vulnerability. The threshold number of row activations (TRH) required to induce RH has reduced rapidly from 139K in 2014 to 4.8K in 2020, and TRH is expected to reduce further, making RH even more severe for future DRAM. Therefore, solutions for mitigating RH should be effective not only at current TRH but also at future TRH. In this paper, we investigate the mitigation of RH at ultra-low thresholds (500 and below). At such thresholds, state-of-the-art solutions, which rely on SRAM or CAM for tracking row activations, incur impractical storage overheads (340KB or more per rank at TRH of 500), making such solutions unappealing for commercial adoption. Alternative solutions, which store per-row metadata in the addressable DRAM space, incur significant slowdown (25% on average) due to extra memory accesses, even in the presence of metadata caches. Our goal is to develop scalable RH mitigation while incurring low SRAM and performance overheads. To that end, this paper proposes Hydra, a Hybrid Tracker for RH mitigation, which combines the best of both SRAM and DRAM to enable low-cost mitigation of RH at ultra-low thresholds. Hydra consists of two structures. First, an SRAM-based structure that tracks aggregated counts at the granularity of a group of rows, and is sufficient for the vast majority of rows that receive only a few activations. Second, a per-row tracker stored in the DRAM-array, which can track an arbitrary number of rows, however, to limit performance overheads, this tracker is used only for the small number of rows that exceed the tracking capability of the SRAM-based structure. We provide a security analysis of Hydra to show that Hydra can reliably issue a mitigation within the specified threshold. Our evaluations show that Hydra enables robust mitigation of RH, while incurring an SRAM overhead of only 28 KB per-rank and an average slowdown of only 0.7% (at TRH of 500).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130397604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W. Fletcher, C. Hughes, J. Torrellas
Graph Neural Networks (GNNs) are becoming popular because they are effective at extracting information from graphs. To execute GNNs, CPUs are good platforms because of their high availability and terabyte-level memory capacity, which enables full-batch computation on large graphs. However, GNNs on CPUs are heavily memory bound, which limits their performance. In this paper, we address this problem by alleviating the stress of GNNs on memory with cooperative software-hardware techniques. Our software techniques include: (i) layer fusion that overlaps the memory-intensive phase and the compute-intensive phase in a GNN layer, (ii) feature compression that reduces memory traffic by exploiting the sparsity in the vertex feature vectors, and (iii) an algorithm that changes the processing order of vertices to improve temporal locality. On top of the software techniques, we enhance the CPUs' direct memory access (DMA) engines with the capability to execute the GNNs' memory-intensive phase, so that the processor cores can focus on the compute-intensive phase. We call the combination of our software and hardware techniques Graphite. We evaluate Graphite with popular GNN models on large graphs. The result is high-performance full-batch GNN training and inference on CPUs. Our software techniques outperform a state-of-the-art GNN layer implementation by 1.7--1.9x in inference and 1.6--2.6x in training. Our combined software and hardware techniques speedup inference by 1.6--2.0x and training by 1.9--3.1x.
{"title":"Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques","authors":"Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W. Fletcher, C. Hughes, J. Torrellas","doi":"10.1145/3470496.3527403","DOIUrl":"https://doi.org/10.1145/3470496.3527403","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming popular because they are effective at extracting information from graphs. To execute GNNs, CPUs are good platforms because of their high availability and terabyte-level memory capacity, which enables full-batch computation on large graphs. However, GNNs on CPUs are heavily memory bound, which limits their performance. In this paper, we address this problem by alleviating the stress of GNNs on memory with cooperative software-hardware techniques. Our software techniques include: (i) layer fusion that overlaps the memory-intensive phase and the compute-intensive phase in a GNN layer, (ii) feature compression that reduces memory traffic by exploiting the sparsity in the vertex feature vectors, and (iii) an algorithm that changes the processing order of vertices to improve temporal locality. On top of the software techniques, we enhance the CPUs' direct memory access (DMA) engines with the capability to execute the GNNs' memory-intensive phase, so that the processor cores can focus on the compute-intensive phase. We call the combination of our software and hardware techniques Graphite. We evaluate Graphite with popular GNN models on large graphs. The result is high-performance full-batch GNN training and inference on CPUs. Our software techniques outperform a state-of-the-art GNN layer implementation by 1.7--1.9x in inference and 1.6--2.6x in training. Our combined software and hardware techniques speedup inference by 1.6--2.0x and training by 1.9--3.1x.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121913009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikola Samardzic, Axel Feldmann, A. Krastev, Nathan Manohar, N. Genise, S. Devadas, Karim M. El Defrawy, Chris Peikert, Daniel Sánchez
Fully Homomorphic Encryption (FHE) enables offloading computation to untrusted servers with cryptographic privacy. Despite its attractive security, FHE is not yet widely adopted due to its prohibitive overheads, about 10,000X over unencrypted computation. Recent FHE accelerators have made strides to bridge this performance gap. Unfortunately, prior accelerators only work well for simple programs, but become inefficient for complex programs, which bring additional costs and challenges. We present CraterLake, the first FHE accelerator that enables FHE programs of unbounded size (i.e., unbounded multiplicative depth). Such computations require very large ciphertexts (tens of MBs each) and different algorithms that prior work does not support well. To tackle this challenge, CraterLake introduces a new hardware architecture that efficiently scales to very large cipher-texts, novel functional units to accelerate key kernels, and new algorithms and compiler techniques to reduce data movement. We evaluate CraterLake on deep FHE programs, including deep neural networks like ResNet and LSTMs, where prior work takes minutes to hours per inference on a CPU. CraterLake outperforms a CPU by gmean 4,600X and the best prior FHE accelerator by 11.2X under similar area and power budgets. These speeds enable realtime performance on unbounded FHE programs for the first time.
{"title":"CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data","authors":"Nikola Samardzic, Axel Feldmann, A. Krastev, Nathan Manohar, N. Genise, S. Devadas, Karim M. El Defrawy, Chris Peikert, Daniel Sánchez","doi":"10.1145/3470496.3527393","DOIUrl":"https://doi.org/10.1145/3470496.3527393","url":null,"abstract":"Fully Homomorphic Encryption (FHE) enables offloading computation to untrusted servers with cryptographic privacy. Despite its attractive security, FHE is not yet widely adopted due to its prohibitive overheads, about 10,000X over unencrypted computation. Recent FHE accelerators have made strides to bridge this performance gap. Unfortunately, prior accelerators only work well for simple programs, but become inefficient for complex programs, which bring additional costs and challenges. We present CraterLake, the first FHE accelerator that enables FHE programs of unbounded size (i.e., unbounded multiplicative depth). Such computations require very large ciphertexts (tens of MBs each) and different algorithms that prior work does not support well. To tackle this challenge, CraterLake introduces a new hardware architecture that efficiently scales to very large cipher-texts, novel functional units to accelerate key kernels, and new algorithms and compiler techniques to reduce data movement. We evaluate CraterLake on deep FHE programs, including deep neural networks like ResNet and LSTMs, where prior work takes minutes to hours per inference on a CPU. CraterLake outperforms a CPU by gmean 4,600X and the best prior FHE accelerator by 11.2X under similar area and power budgets. These speeds enable realtime performance on unbounded FHE programs for the first time.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128519881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siying Feng, Xin He, Kuan-Yu Chen, Liu Ke, Xuan Zhang, D. Blaauw, T. Mudge, R. Dreslinski
Near-memory processing has been extensively studied to optimize memory intensive workloads. However, none of the proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. Prior work shows that sparse matrix transposition does not scale as well as other sparse primitives such as sparse matrix vector multiplication (SpMV) and hence has become a growing bottleneck in common applications. Sparse matrix transposition is highly memory intensive but low in computational intensity, making it a promising candidate for near-memory processing. In this work, we propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. MeNDA adopts a merge sort based algorithm, exploiting spatial locality, and proposes a near-memory processing unit (PU) featuring a high-performance hardware merge tree. Because of the wide application of merge sort in sparse linear algebra, MeNDA is an extensible solution that can be easily adapted to support other sparse primitives such as SpMV. Techniques including seamless back-to-back merge sort, stall reducing prefetching and request coalescing are further explored to take full advantage of the increased system memory bandwidth. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1X, 12.0X, and 7.7x, respectively. MeNDA also shows an efficiency gain of 3.8x over a recent SpMV accelerator integrated with HBM. Incurring a power consumption of only 78.6 mW, a MeNDA PU can be easily accommodated by commodity DIMMs.
{"title":"MeNDA: a near-memory multi-way merge solution for sparse transposition and dataflows","authors":"Siying Feng, Xin He, Kuan-Yu Chen, Liu Ke, Xuan Zhang, D. Blaauw, T. Mudge, R. Dreslinski","doi":"10.1145/3470496.3527432","DOIUrl":"https://doi.org/10.1145/3470496.3527432","url":null,"abstract":"Near-memory processing has been extensively studied to optimize memory intensive workloads. However, none of the proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. Prior work shows that sparse matrix transposition does not scale as well as other sparse primitives such as sparse matrix vector multiplication (SpMV) and hence has become a growing bottleneck in common applications. Sparse matrix transposition is highly memory intensive but low in computational intensity, making it a promising candidate for near-memory processing. In this work, we propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. MeNDA adopts a merge sort based algorithm, exploiting spatial locality, and proposes a near-memory processing unit (PU) featuring a high-performance hardware merge tree. Because of the wide application of merge sort in sparse linear algebra, MeNDA is an extensible solution that can be easily adapted to support other sparse primitives such as SpMV. Techniques including seamless back-to-back merge sort, stall reducing prefetching and request coalescing are further explored to take full advantage of the increased system memory bandwidth. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1X, 12.0X, and 7.7x, respectively. MeNDA also shows an efficiency gain of 3.8x over a recent SpMV accelerator integrated with HBM. Incurring a power consumption of only 78.6 mW, a MeNDA PU can be easily accommodated by commodity DIMMs.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116114738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandar Devic, S. Rai, A. Sivasubramaniam, Ameen Akel, S. Eilert, Justin Eno
As Processing-In-Memory (PIM) hardware matures and starts making its way into normal compute platforms, software has an important role to play in determining what to perform where, and when, on such heterogeneous systems. Taking an emerging class of PIM hardware which provisions a general purpose (RISC-V) processor at each memory bank, this paper takes on this challenging problem by developing a software compilation framework. This framework analyzes several application characteristics - parallelizability, vectorizability, data set sizes, and offload costs - to determine what, whether, when and how to offload computations to the PIM engines. In the process, it also proposes a vector engine extension to the bank-level RISC-V cores. Using several off-the-shelf C/C++ applications, we demonstrate that PIM is not always a panacea, and a framework such as ours is essential in carefully selecting what needs to be performed where, when and how. The choice of hardware platforms - number of memory banks, relative speeds and capabilities of host CPU and PIM cores, can further impact the "to PIM or not" question.
{"title":"To PIM or not for emerging general purpose processing in DDR memory systems","authors":"Alexandar Devic, S. Rai, A. Sivasubramaniam, Ameen Akel, S. Eilert, Justin Eno","doi":"10.1145/3470496.3527431","DOIUrl":"https://doi.org/10.1145/3470496.3527431","url":null,"abstract":"As Processing-In-Memory (PIM) hardware matures and starts making its way into normal compute platforms, software has an important role to play in determining what to perform where, and when, on such heterogeneous systems. Taking an emerging class of PIM hardware which provisions a general purpose (RISC-V) processor at each memory bank, this paper takes on this challenging problem by developing a software compilation framework. This framework analyzes several application characteristics - parallelizability, vectorizability, data set sizes, and offload costs - to determine what, whether, when and how to offload computations to the PIM engines. In the process, it also proposes a vector engine extension to the bank-level RISC-V cores. Using several off-the-shelf C/C++ applications, we demonstrate that PIM is not always a panacea, and a framework such as ours is essential in carefully selecting what needs to be performed where, when and how. The choice of hardware platforms - number of memory banks, relative speeds and capabilities of host CPU and PIM cores, can further impact the \"to PIM or not\" question.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132929226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment. In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We first design the INSPIRE protocol with a multi-stage filtering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE's protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous in-storage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme.
{"title":"INSPIRE: in-storage private information retrieval via protocol and architecture co-design","authors":"Jilan Lin, Ling Liang, Zheng Qu, Ishtiyaque Ahmad, L. Liu, Fengbin Tu, Trinabh Gupta, Yufei Ding, Yuan Xie","doi":"10.1145/3470496.3527433","DOIUrl":"https://doi.org/10.1145/3470496.3527433","url":null,"abstract":"Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment. In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We first design the INSPIRE protocol with a multi-stage filtering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE's protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous in-storage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125038592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanchao Xu, Chencheng Ye, Yan Solihin, Xipeng Shen
Persistent Memory (PM) is increasingly supplementing or substituting DRAM as main memory. Prior work have focused on reusability and memory leaks of persistent memory but have not addressed a problem amplified by persistence, persistent memory fragmentation, which refers to the continuous worsening of fragmentation of persistent memory throughout its usage. This paper reveals the challenges and proposes the first systematic crash-consistent solution, Fence-Free Crash-consistent Concurrent Defragmentation (FFCCD). FFCCD resues persistent pointer format, root nodes and typed allocation provided by persistent memory programming model to enable concurrent defragmentation on PM. FFCCD introduces architecture support for concurrent defragmentation that enables a fence-free design and fast read barrier, reducing two major overheads of defragmenting persistent memory. The techniques is effective (28--73% fragmentation reduction) and fast (4.1% execution time overhead).
{"title":"FFCCD","authors":"Yuanchao Xu, Chencheng Ye, Yan Solihin, Xipeng Shen","doi":"10.1145/3470496.3527406","DOIUrl":"https://doi.org/10.1145/3470496.3527406","url":null,"abstract":"Persistent Memory (PM) is increasingly supplementing or substituting DRAM as main memory. Prior work have focused on reusability and memory leaks of persistent memory but have not addressed a problem amplified by persistence, persistent memory fragmentation, which refers to the continuous worsening of fragmentation of persistent memory throughout its usage. This paper reveals the challenges and proposes the first systematic crash-consistent solution, Fence-Free Crash-consistent Concurrent Defragmentation (FFCCD). FFCCD resues persistent pointer format, root nodes and typed allocation provided by persistent memory programming model to enable concurrent defragmentation on PM. FFCCD introduces architecture support for concurrent defragmentation that enables a fence-free design and fast read barrier, reducing two major overheads of defragmenting persistent memory. The techniques is effective (28--73% fragmentation reduction) and fast (4.1% execution time overhead).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116443327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Loughlin, S. Saroiu, A. Wolman, Yatin A. Manerkar, Baris Kasikci
Prior work shows that Rowhammer attacks---which flip bits in DRAM via frequent activations of the same row(s)---are viable. Adversaries typically mount these attacks via instruction sequences that are carefully-crafted to bypass CPU caches. However, we discover a novel form of hammering that we refer to as coherence-induced hammering, caused by Intel's implementations of cache coherent non-uniform memory access (ccNUMA) protocols. We show that this hammering occurs in commodity benchmarks on a major cloud provider's production hardware, the first hammering found to be generated by non-malicious code. Given DRAM's rising susceptibility to bit flips, it is paramount to prevent coherence-induced hammering to ensure reliability and security in the cloud. Accordingly, we introduce MOESI-prime, a ccNUMA coherence protocol that mitigates coherence-induced hammering while retaining Intel's state-of-the-art scalability. MOESI-prime shows that most DRAM reads and writes triggering such hammering are unnecessary. Thus, by encoding additional information in the coherence protocol, MOESI-prime can omit these reads and writes, preventing coherence-induced hammering in non-malicious and malicious workloads. Furthermore, by omitting unnecessary reads and writes, MOESI-prime has negligible effect on average performance (within ±0.61% of MESI and MOESI) and average DRAM power (0.03%-0.22% improvement) across evaluated ccNUMA configurations.
{"title":"MOESI-prime: preventing coherence-induced hammering in commodity workloads","authors":"Kevin Loughlin, S. Saroiu, A. Wolman, Yatin A. Manerkar, Baris Kasikci","doi":"10.1145/3470496.3527427","DOIUrl":"https://doi.org/10.1145/3470496.3527427","url":null,"abstract":"Prior work shows that Rowhammer attacks---which flip bits in DRAM via frequent activations of the same row(s)---are viable. Adversaries typically mount these attacks via instruction sequences that are carefully-crafted to bypass CPU caches. However, we discover a novel form of hammering that we refer to as coherence-induced hammering, caused by Intel's implementations of cache coherent non-uniform memory access (ccNUMA) protocols. We show that this hammering occurs in commodity benchmarks on a major cloud provider's production hardware, the first hammering found to be generated by non-malicious code. Given DRAM's rising susceptibility to bit flips, it is paramount to prevent coherence-induced hammering to ensure reliability and security in the cloud. Accordingly, we introduce MOESI-prime, a ccNUMA coherence protocol that mitigates coherence-induced hammering while retaining Intel's state-of-the-art scalability. MOESI-prime shows that most DRAM reads and writes triggering such hammering are unnecessary. Thus, by encoding additional information in the coherence protocol, MOESI-prime can omit these reads and writes, preventing coherence-induced hammering in non-malicious and malicious workloads. Furthermore, by omitting unnecessary reads and writes, MOESI-prime has negligible effect on average performance (within ±0.61% of MESI and MOESI) and average DRAM power (0.03%-0.22% improvement) across evaluated ccNUMA configurations.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126664969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas Mosier, Hanna Lachnitt, Hamed Nemati, Caroline Trippel
We propose leakage containment models (LCMs)---novel axiomatic security contracts which support formally reasoning about the security guarantees of programs when they run on particular microarchitectures. Our core contribution is an axiomatic vocabulary for formalizing LCMs, derived from the established axiomatic vocabulary for formalizing processor memory consistency models. Using this vocabulary, we formalize microarchitectural leakage---focusing on leakage through hardware memory systems---so that it can be automatically detected in programs and provide a taxonomy for classifying said leakage by severity. To illustrate the efficacy of LCMs, we first demonstrate that our leakage definition faithfully captures a sampling of (transient and non-transient) microarchitectural attacks from the literature. Second, we develop a static analysis tool based on LCMs which automatically identifies Spectre vulnerabilities in programs and scales to analyze real-world crypto-libraries.
{"title":"Axiomatic hardware-software contracts for security","authors":"Nicholas Mosier, Hanna Lachnitt, Hamed Nemati, Caroline Trippel","doi":"10.1145/3470496.3527412","DOIUrl":"https://doi.org/10.1145/3470496.3527412","url":null,"abstract":"We propose leakage containment models (LCMs)---novel axiomatic security contracts which support formally reasoning about the security guarantees of programs when they run on particular microarchitectures. Our core contribution is an axiomatic vocabulary for formalizing LCMs, derived from the established axiomatic vocabulary for formalizing processor memory consistency models. Using this vocabulary, we formalize microarchitectural leakage---focusing on leakage through hardware memory systems---so that it can be automatically detected in programs and provide a taxonomy for classifying said leakage by severity. To illustrate the efficacy of LCMs, we first demonstrate that our leakage definition faithfully captures a sampling of (transient and non-transient) microarchitectural attacks from the literature. Second, we develop a static analysis tool based on LCMs which automatically identifies Spectre vulnerabilities in programs and scales to analyze real-world crypto-libraries.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121171221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}