首页 > 最新文献

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

英文 中文
Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms 利用不同平台上的自动机处理寻找CRISPR/Cas9潜在的gRNA脱靶位点
Chunkun Bo, V. Dang, Elaheh Sadredini, K. Skadron
The CRISPR/Cas system is a bacteria immune system protecting cells from foreign genetic elements. One version that attracted special interest is CRISPR/Cas9, because it can be modified to edit genomes at targeted locations. However, the risk of binding and damaging off-target locations limits its power. Identifying all these potential off-target sites is thus important for users to effectively use the system to edit genomes. This process is computationally expensive, especially when one allows more differences in gRNA targeting sequences. In this paper, we propose using automata to search for off-target sites while allowing differences between the reference genome and gRNA targeting sequences. We evaluate the automata-based approach on four different platforms, including conventional architectures such as the CPU and the GPU, and spatial architectures such as the FPGA and Micron's Automata Processor. We compare the proposed approach with two off-target search tools (CasOFFinder (GPU) and CasOT (CPU)), and achieve over 83x speedups on the FPGA compared with CasOFFinder and over 600x speedups compared with CasOT. More customized hardware such as the AP can provide additional speedups (1.5x for the kernel execution) compared with the FPGA. We also evaluate the automata-based solution using single-thread HyperScan (a high-performance automata processing library) on the CPU. HyperScan outperforms CasOT by over 29.7x. The automata-based approach on iNFAnt2 (a DFA/NFA engine on the GPU) does not consistently work better than CasOFFinder, and only show a slightly better speedup compared with single-thread HyperScan on the CPU (4.4x for the best case). These results show that the automata-based approach provides significant algorithmic benefits, and that accelerators such as the FPGA and the AP can provide substantial additional speedups. However, iNFAnt2 does not confer a clear advantage because the proposed method does not map well to the GPU architecture. Furthermore, we propose several methods to further improve the performance on spatial architectures, and some potential architectural modifications for future automata processing hardware.
CRISPR/Cas系统是一种细菌免疫系统,可以保护细胞免受外来遗传因子的侵害。其中一个版本引起了特别的兴趣,那就是CRISPR/Cas9,因为它可以在目标位置编辑基因组。然而,绑定和破坏非目标位置的风险限制了它的力量。因此,识别所有这些潜在的脱靶位点对于用户有效地使用该系统编辑基因组非常重要。这个过程在计算上是昂贵的,特别是当一个人允许更多的gRNA靶向序列的差异时。在本文中,我们建议使用自动机来搜索脱靶位点,同时允许参考基因组和gRNA靶向序列之间的差异。我们在四种不同的平台上评估了基于自动机的方法,包括传统架构(如CPU和GPU)和空间架构(如FPGA和美光的自动机处理器)。我们将所提出的方法与两种脱靶搜索工具(CasOFFinder (GPU)和CasOT (CPU))进行了比较,与CasOFFinder相比,FPGA的速度提高了83倍以上,与CasOT相比速度提高了600倍以上。与FPGA相比,更多定制的硬件(如AP)可以提供额外的速度提升(内核执行速度提高1.5倍)。我们还在CPU上使用单线程HyperScan(一种高性能自动机处理库)评估了基于自动机的解决方案。HyperScan的性能比CasOT高出29.7倍以上。在iNFAnt2 (GPU上的DFA/NFA引擎)上基于自动机的方法并不总是比CasOFFinder工作得更好,与CPU上的单线程HyperScan相比,只显示出稍微更好的加速(最佳情况下为4.4倍)。这些结果表明,基于自动机的方法提供了显著的算法优势,并且FPGA和AP等加速器可以提供大量额外的速度提升。然而,iNFAnt2并没有赋予明显的优势,因为所提出的方法不能很好地映射到GPU架构。此外,我们提出了几种进一步提高空间架构性能的方法,以及未来自动机处理硬件的一些潜在架构修改。
{"title":"Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms","authors":"Chunkun Bo, V. Dang, Elaheh Sadredini, K. Skadron","doi":"10.1109/HPCA.2018.00068","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00068","url":null,"abstract":"The CRISPR/Cas system is a bacteria immune system protecting cells from foreign genetic elements. One version that attracted special interest is CRISPR/Cas9, because it can be modified to edit genomes at targeted locations. However, the risk of binding and damaging off-target locations limits its power. Identifying all these potential off-target sites is thus important for users to effectively use the system to edit genomes. This process is computationally expensive, especially when one allows more differences in gRNA targeting sequences. In this paper, we propose using automata to search for off-target sites while allowing differences between the reference genome and gRNA targeting sequences. We evaluate the automata-based approach on four different platforms, including conventional architectures such as the CPU and the GPU, and spatial architectures such as the FPGA and Micron's Automata Processor. We compare the proposed approach with two off-target search tools (CasOFFinder (GPU) and CasOT (CPU)), and achieve over 83x speedups on the FPGA compared with CasOFFinder and over 600x speedups compared with CasOT. More customized hardware such as the AP can provide additional speedups (1.5x for the kernel execution) compared with the FPGA. We also evaluate the automata-based solution using single-thread HyperScan (a high-performance automata processing library) on the CPU. HyperScan outperforms CasOT by over 29.7x. The automata-based approach on iNFAnt2 (a DFA/NFA engine on the GPU) does not consistently work better than CasOFFinder, and only show a slightly better speedup compared with single-thread HyperScan on the CPU (4.4x for the best case). These results show that the automata-based approach provides significant algorithmic benefits, and that accelerators such as the FPGA and the AP can provide substantial additional speedups. However, iNFAnt2 does not confer a clear advantage because the proposed method does not map well to the GPU architecture. Furthermore, we propose several methods to further improve the performance on spatial architectures, and some potential architectural modifications for future automata processing hardware.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124468455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Reducing Data Transfer Energy by Exploiting Similarity within a Data Transaction 利用数据事务中的相似性减少数据传输能量
Donghyuk Lee, Mike O'Connor, Niladrish Chatterjee
Modern highly parallel GPU systems require highbandwidth DRAM I/O interfaces that can consume a significant amount of energy. This energy increases in proportion to the number of 1 values in the data transactions due to the asymmetric energy consumption of Pseudo Open Drain (POD) I/O interface in contemporary Graphics DDR SDRAMs. In this work, we describe a technique to save energy by reducing the energy-expensive 1 values in the DRAM interface. We observe that multiple data elements within a single cache line/sector are often similar to one another. We exploit this characteristic to encode each transfer to the DRAM such that there is one reference copy of the data, with remaining similar data items being encoded predominantly as 0 values. Our proposed low energy data transfer mechanism, Base+XOR Transfer, encodes the data-similar portion by performing XOR operations between data elements within a single DRAM transaction. We address two challenges that influence the efficiency of our mechanism, i) the frequent appearance of zero data elements in transactions, and ii) the diversity in the underlying size of data types within a transaction. We describe two techniques, Zero Data Remapping and Universal Base+XOR Transfer, to efficiently address these issues. Our proposed encoding scheme requires no additional metadata or changes to existing DRAM devices. We evaluate our mechanism on a modern high performance GPU system with a variety of graphics and compute workloads. We show that our mechanism reduces energy-expensive 1 values by 35.3% with minimal overheads, and combining our mechanism with Dynamic Bus Inversion (DBI) reduces 1 values by 48.2% on average. These 1 value reductions lead to 5.8% and 7.1% DRAM energy savings, respectively.
现代高度并行的GPU系统需要高带宽的DRAM I/O接口,这可能会消耗大量的能量。由于当代图形DDR sdram中的伪开放漏(POD) I/O接口的不对称能耗,该能量与数据事务中1值的数量成比例地增加。在这项工作中,我们描述了一种通过降低DRAM接口中耗能昂贵的1值来节省能源的技术。我们观察到,单个缓存线路/扇区中的多个数据元素通常彼此相似。我们利用这一特性对每次传输到DRAM的数据进行编码,这样就有一个数据的参考副本,其余类似的数据项主要被编码为0值。我们提出的低能量数据传输机制,Base+XOR传输,通过在单个DRAM事务中执行数据元素之间的XOR操作来编码数据相似部分。我们解决了影响我们机制效率的两个挑战,1)事务中经常出现零数据元素,以及2)事务中数据类型的底层大小的多样性。我们描述了两种技术,零数据重映射和通用基+异或传输,以有效地解决这些问题。我们提出的编码方案不需要额外的元数据或对现有DRAM设备的更改。我们在具有各种图形和计算工作负载的现代高性能GPU系统上评估我们的机制。我们表明,我们的机制以最小的开销将能源昂贵的1值降低了35.3%,并且将我们的机制与动态总线反演(DBI)相结合,平均降低了48.2%的1值。这1个值的降低分别导致DRAM节能5.8%和7.1%。
{"title":"Reducing Data Transfer Energy by Exploiting Similarity within a Data Transaction","authors":"Donghyuk Lee, Mike O'Connor, Niladrish Chatterjee","doi":"10.1109/HPCA.2018.00014","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00014","url":null,"abstract":"Modern highly parallel GPU systems require highbandwidth DRAM I/O interfaces that can consume a significant amount of energy. This energy increases in proportion to the number of 1 values in the data transactions due to the asymmetric energy consumption of Pseudo Open Drain (POD) I/O interface in contemporary Graphics DDR SDRAMs. In this work, we describe a technique to save energy by reducing the energy-expensive 1 values in the DRAM interface. We observe that multiple data elements within a single cache line/sector are often similar to one another. We exploit this characteristic to encode each transfer to the DRAM such that there is one reference copy of the data, with remaining similar data items being encoded predominantly as 0 values. Our proposed low energy data transfer mechanism, Base+XOR Transfer, encodes the data-similar portion by performing XOR operations between data elements within a single DRAM transaction. We address two challenges that influence the efficiency of our mechanism, i) the frequent appearance of zero data elements in transactions, and ii) the diversity in the underlying size of data types within a transaction. We describe two techniques, Zero Data Remapping and Universal Base+XOR Transfer, to efficiently address these issues. Our proposed encoding scheme requires no additional metadata or changes to existing DRAM devices. We evaluate our mechanism on a modern high performance GPU system with a variety of graphics and compute workloads. We show that our mechanism reduces energy-expensive 1 values by 35.3% with minimal overheads, and combining our mechanism with Dynamic Bus Inversion (DBI) reduces 1 values by 48.2% on average. These 1 value reductions lead to 5.8% and 7.1% DRAM energy savings, respectively.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116694242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Don’t Correct the Tags in a Cache, Just Check Their Hamming Distance from the Lookup Tag 不要纠正缓存中的标签,只是检查它们与查找标签的汉明距离
Alex Gendler, A. Bramnik, Ariel Szapiro, Yiannakis Sazeides
This paper describes the design of an efficient technique for correcting errors in the tag array of set-associative caches. The main idea behind this scheme is that for a cache tag array protected with ECC code, the stored tags do not need to be corrected prior to the comparison against a lookup tag for cache hit/miss definition. This eliminates the need for costly hardware to correct the cache tags before checking for a hit or a miss. The paper reveals the various optimizations needed to translate this idea into a design that delivers a practical improvement in a product. An analysis of our design, as compared to state of the art methods, shows that it can provide the same correction and detection strength with less area, power and timing overheads and better performance. An Intel Core® microprocessor is implementing this technique in its second level and third level caches.
本文设计了一种有效的纠错技术,用于集关联缓存中标签数组的纠错。该方案背后的主要思想是,对于受ECC代码保护的缓存标签数组,存储的标签不需要在与缓存命中/未命中定义的查找标签进行比较之前进行纠正。这消除了在检查命中或未命中之前需要昂贵的硬件来纠正缓存标签的需要。论文揭示了将这一想法转化为设计所需的各种优化,从而提供了产品的实际改进。对我们设计的分析表明,与最先进的方法相比,它可以提供相同的校正和检测强度,面积,功率和时序开销更小,性能更好。英特尔酷睿®微处理器在其第二级和第三级缓存中实现了这种技术。
{"title":"Don’t Correct the Tags in a Cache, Just Check Their Hamming Distance from the Lookup Tag","authors":"Alex Gendler, A. Bramnik, Ariel Szapiro, Yiannakis Sazeides","doi":"10.1109/HPCA.2018.00055","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00055","url":null,"abstract":"This paper describes the design of an efficient technique for correcting errors in the tag array of set-associative caches. The main idea behind this scheme is that for a cache tag array protected with ECC code, the stored tags do not need to be corrected prior to the comparison against a lookup tag for cache hit/miss definition. This eliminates the need for costly hardware to correct the cache tags before checking for a hit or a miss. The paper reveals the various optimizations needed to translate this idea into a design that delivers a practical improvement in a product. An analysis of our design, as compared to state of the art methods, shows that it can provide the same correction and detection strength with less area, power and timing overheads and better performance. An Intel Core® microprocessor is implementing this technique in its second level and third level caches.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127492498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices DRAM延迟PUF:利用现代商品DRAM设备的延迟-可靠性权衡快速评估物理不可克隆功能
Jeremie S. Kim, Minesh Patel, Hasan Hassan, O. Mutlu
Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specications using software-only system calls. Doing so results in error patterns that reect the compound eects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense ampli- ers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70C and 1426x (868x, 1783x) at 55C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.
物理不可克隆函数(puf)通常用于密码学中,基于其物理微观结构的唯一性来识别设备。基于DRAM的PUF与利用替代基板的PUF设计相比具有许多优势:DRAM是许多现代系统的主要组件,基于DRAM的PUF可以生成许多唯一标识符。然而,之前的DRAM PUF提案都没有提供适合在商用DRAM设备上运行时可访问的PUF评估的实现。先前的DRAM PUF表现出令人无法接受的高延迟,特别是在低温下(例如,低于55C的64KiB内存段平均延迟125.8秒),并且在PUF评估期间保持部分DRAM不可用,从而导致高系统干扰。本文介绍了一种新的快速、可靠的DRAM延时PUF。关键思想是使用纯软件系统调用将DRAM读访问延迟降低到可靠的数据表规范以下。这样做的结果是误差模式,反映了在各种DRAM结构(例如,电容器、导线、感测放大器)中制造变化的复合效应。基于223个现代LPDDR4 DRAM芯片的严格实验表征,我们证明了这些错误模式1)满足运行时可访问的PUF要求,2)在没有额外硬件修改的实际系统中,无论工作温度如何,都能快速生成(即88.2ms)。我们表明,对于恒定的DRAM容量开销64KiB,我们的DRAM延迟PUF实现与DRAM保留PUF相比,在70C下实现了152x (109x, 181x)的平均(最小,最大)PUF评估时间加速,在55C下实现了1426x (868x, 1783x)的加速,并且在更低的温度下实现了更大的速度。
{"title":"The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices","authors":"Jeremie S. Kim, Minesh Patel, Hasan Hassan, O. Mutlu","doi":"10.1109/HPCA.2018.00026","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00026","url":null,"abstract":"Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specications using software-only system calls. Doing so results in error patterns that reect the compound eects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense ampli- ers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70C and 1426x (868x, 1783x) at 55C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123714010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
Comprehensive VM Protection Against Untrusted Hypervisor Through Retrofitted AMD Memory Encryption 通过改进的AMD内存加密,对不受信任的Hypervisor进行全面的虚拟机保护
Yuming Wu, Yutao Liu, Ruifeng Liu, Haibo Chen, B. Zang, Haibing Guan
The confidentiality of tenant’s data is confronted with high risk when facing hardware attacks and privileged malicious software. Hardware-based memory encryption is one of the promising means to provide strong guarantees of data security. Recently AMD has proposed its new memory encryption hardware called SME and SEV, which can selectively encrypt memory regions in a fine-grained manner, e.g., by setting the C-bits in the page table entries. More importantly, SEV further supports encrypted virtual machines. This, intuitively, has provided a new opportunity to protect data confidentiality in guest VMs against an untrusted hypervisor in the cloud environment. In this paper, we first provide a security analysis on the (in)security of SEV and uncover a set of security issues of using SEV as a means to defend against an untrusted hypervisor. Based on the study, we then propose a software-based extension to the SEV feature, namely Fidelius, to address those issues while retaining performance efficiency. Fidelius separates the management of critical resources from service provisioning and revokes the permissions of accessing specific resources from the un-trusted hypervisor. By adopting a sibling-based protection mechanism with non-bypassable memory isolation, Fidelius embraces both security and efficiency, as it introduces no new layer of abstraction. Meanwhile, Fidelius reuses the SEV API to provide a full VM life-cycle protection, including two sets of para-virtualized I/O interfaces to encode the I/O data, which is not considered in the SEV hardware design. A detailed and quantitative security analysis shows its effectiveness in protecting tenant’s data from a variety of attack surfaces, and the performance evaluation confirms the performance efficiency of Fidelius.
在面临硬件攻击和特权恶意软件的情况下,租户数据的保密性面临着很大的风险。基于硬件的内存加密是一种很有前途的数据安全保障手段。最近AMD提出了新的内存加密硬件SME和SEV,它们可以以细粒度的方式选择性地加密内存区域,例如,通过在页表项中设置c位。更重要的是,SEV进一步支持加密虚拟机。直观地说,这提供了一个新的机会来保护客户虚拟机中的数据机密性,使其免受云环境中不受信任的管理程序的攻击。在本文中,我们首先对SEV的安全性进行了安全分析,并揭示了使用SEV作为防御不受信任的管理程序的一组安全问题。在此基础上,我们提出了一个基于软件的SEV特性扩展,即Fidelius,以解决这些问题,同时保持性能效率。Fidelius将关键资源的管理与服务发放分离,并从不受信任的hypervisor中撤销访问特定资源的权限。通过采用基于兄弟的保护机制和不可绕过的内存隔离,Fidelius既安全又高效,因为它没有引入新的抽象层。同时,Fidelius重用SEV API来提供完整的虚拟机生命周期保护,包括两组准虚拟化I/O接口来编码I/O数据,这在SEV硬件设计中没有考虑到。详细和定量的安全分析表明,Fidelius在保护租户数据免受各种攻击面方面是有效的,性能评估证实了Fidelius的性能效率。
{"title":"Comprehensive VM Protection Against Untrusted Hypervisor Through Retrofitted AMD Memory Encryption","authors":"Yuming Wu, Yutao Liu, Ruifeng Liu, Haibo Chen, B. Zang, Haibing Guan","doi":"10.1109/HPCA.2018.00045","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00045","url":null,"abstract":"The confidentiality of tenant’s data is confronted with high risk when facing hardware attacks and privileged malicious software. Hardware-based memory encryption is one of the promising means to provide strong guarantees of data security. Recently AMD has proposed its new memory encryption hardware called SME and SEV, which can selectively encrypt memory regions in a fine-grained manner, e.g., by setting the C-bits in the page table entries. More importantly, SEV further supports encrypted virtual machines. This, intuitively, has provided a new opportunity to protect data confidentiality in guest VMs against an untrusted hypervisor in the cloud environment. In this paper, we first provide a security analysis on the (in)security of SEV and uncover a set of security issues of using SEV as a means to defend against an untrusted hypervisor. Based on the study, we then propose a software-based extension to the SEV feature, namely Fidelius, to address those issues while retaining performance efficiency. Fidelius separates the management of critical resources from service provisioning and revokes the permissions of accessing specific resources from the un-trusted hypervisor. By adopting a sibling-based protection mechanism with non-bypassable memory isolation, Fidelius embraces both security and efficiency, as it introduces no new layer of abstraction. Meanwhile, Fidelius reuses the SEV API to provide a full VM life-cycle protection, including two sets of para-virtualized I/O interfaces to encode the I/O data, which is not considered in the SEV hardware design. A detailed and quantitative security analysis shows its effectiveness in protecting tenant’s data from a variety of attack surfaces, and the performance evaluation confirms the performance efficiency of Fidelius.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123784759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Are Coherence Protocol States Vulnerable to Information Leakage? 相干协议状态易受信息泄露的影响吗?
Fan Yao, M. Doroslovački, Guru Venkataramani
Most commercial multi-core processors incorporate hardware coherence protocols to support efficient data transfers and updates between their constituent cores. While hardware coherence protocols provide immense benefits for application performance by removing the burden of software-based coherence, we note that understanding the security vulnerabilities posed by such oft-used, widely-adopted processor features is critical for secure processor designs in the future. In this paper, we demonstrate a new vulnerability exposed by cache coherence protocol states. We present novel insights into how adversaries could cleverly manipulate the coherence states on shared cache blocks, and construct covert timing channels to illegitimately communicate secrets to the spy. We demonstrate 6 different practical scenarios for covert timing channel construction. In contrast to prior works, we assume a broader adversary model where the trojan and spy can either exploit explicitly shared read-only physical pages (e.g., shared library code), or use memory deduplication feature to implicitly force create shared physical pages. We demonstrate how adversaries can manipulate combinations of coherence states and data placement in different caches to construct timing channels. We also explore how adversaries could exploit multiple caches and their associated coherence states to improve transmission bandwidth with symbols encoding multiple bits. Our experimental results on commercial systems show that the peak transmission bandwidths of these covert timing channels can vary between 700 to 1100 Kbits/sec. To the best of our knowledge, our study is the first to highlight the vulnerability of hardware cache coherence protocols to timing channels that can help computer architects to craft effective defenses against exploits on such critical processor features.
大多数商用多核处理器都包含硬件一致性协议,以支持其组成核心之间有效的数据传输和更新。虽然硬件一致性协议通过消除基于软件的一致性的负担为应用程序性能提供了巨大的好处,但我们注意到,了解这些常用的、广泛采用的处理器特性所带来的安全漏洞对于未来的安全处理器设计至关重要。在本文中,我们展示了由缓存一致性协议状态暴露的一个新的漏洞。我们提出了新的见解,对手如何巧妙地操纵共享缓存块上的相干状态,并构建隐蔽的定时通道来非法地向间谍传递秘密。我们演示了隐蔽定时信道构建的6种不同的实际场景。与之前的工作相反,我们假设一个更广泛的对手模型,其中木马和间谍可以利用显式共享的只读物理页面(例如,共享库代码),或者使用内存重复数据删除功能隐式强制创建共享物理页面。我们演示了对手如何操纵相干状态和数据放置在不同缓存中的组合来构建时序通道。我们还探讨了攻击者如何利用多个缓存及其相关的相干状态来提高编码多个比特的符号的传输带宽。我们在商用系统上的实验结果表明,这些隐蔽定时信道的峰值传输带宽可以在700到1100 Kbits/sec之间变化。据我们所知,我们的研究首次强调了硬件缓存一致性协议对定时通道的脆弱性,这可以帮助计算机架构师制定有效的防御措施,防止对此类关键处理器特性的利用。
{"title":"Are Coherence Protocol States Vulnerable to Information Leakage?","authors":"Fan Yao, M. Doroslovački, Guru Venkataramani","doi":"10.1109/HPCA.2018.00024","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00024","url":null,"abstract":"Most commercial multi-core processors incorporate hardware coherence protocols to support efficient data transfers and updates between their constituent cores. While hardware coherence protocols provide immense benefits for application performance by removing the burden of software-based coherence, we note that understanding the security vulnerabilities posed by such oft-used, widely-adopted processor features is critical for secure processor designs in the future. In this paper, we demonstrate a new vulnerability exposed by cache coherence protocol states. We present novel insights into how adversaries could cleverly manipulate the coherence states on shared cache blocks, and construct covert timing channels to illegitimately communicate secrets to the spy. We demonstrate 6 different practical scenarios for covert timing channel construction. In contrast to prior works, we assume a broader adversary model where the trojan and spy can either exploit explicitly shared read-only physical pages (e.g., shared library code), or use memory deduplication feature to implicitly force create shared physical pages. We demonstrate how adversaries can manipulate combinations of coherence states and data placement in different caches to construct timing channels. We also explore how adversaries could exploit multiple caches and their associated coherence states to improve transmission bandwidth with symbols encoding multiple bits. Our experimental results on commercial systems show that the peak transmission bandwidths of these covert timing channels can vary between 700 to 1100 Kbits/sec. To the best of our knowledge, our study is the first to highlight the vulnerability of hardware cache coherence protocols to timing channels that can help computer architects to craft effective defenses against exploits on such critical processor features.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122508936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Memory Hierarchy for Web Search 网络搜索的内存层次结构
Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan
Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.
在线数据密集型服务(如搜索)为数十亿用户提供服务,利用数百万个核心,并在数据中心规模的工作负载中占据重要的且不断增长的部分。然而,这些工作负载的复杂性及其专有性质妨碍了对处理器设计权衡进行详细的体系结构评估和优化。我们首次详细研究了当今最大的商业搜索引擎的内存层次结构。我们使用了来自成千上万台部署服务器的纵向研究、单个平台上的系统微架构评估、经过验证的跟踪驱动仿真和性能建模的测量组合——所有这些都是由服务于真实用户请求的生产工作负载驱动的。我们的数据量化了产品搜索和架构社区中常用的基准测试之间的显著差异。我们认为内存层次结构是性能优化的一个重要机会,并提出了关于搜索如何强调缓存层次结构的新见解,无论是对于指令还是数据。我们表明,与传统观点相反,当前缓存层次结构无法捕获的数据存在重要的重用,并讨论了为什么这排除了最先进的平铺和向外扩展架构。基于这些见解,我们重新考虑了一种新的缓存层次结构,该结构针对搜索进行了优化,可以将L3缓存晶体管的低效使用与高性能内核相交换,并添加了一个延迟优化的封装eDRAM L4缓存。与最先进的处理器相比,我们提出的设计性能提高了27%到38%。
{"title":"Memory Hierarchy for Web Search","authors":"Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan","doi":"10.1109/HPCA.2018.00061","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00061","url":null,"abstract":"Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122155462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
High-Performance GPU Transactional Memory via Eager Conflict Detection 高性能GPU事务性内存通过急切冲突检测
X. Ren, Mieszko Lis
GPUs transactional memory (TM) proposals to date have relied on lazy, value-based conflict detection, assuming that GPUs can amortize the latency by executing other warps. In practice, however, concurrency must be throttled to a few warps per core to avoid high abort rates, and TM performance has remained far below that of fine-grained locks. We trace this to the latency cost of validating transactions: two round trips across the crossbar required for most commits and aborts. With limited concurrency, the warp scheduler cannot amortize this, and leaves the core idle most of the time. In this paper, we show that value-based validation does not scale to high thread counts, and eager conflict detection becomes more efficient as the number of threads grows. We leverage this insight to propose GETM, a GPU TM with eager conflict detection. GETM relies on a novel distributed logical clock scheme to implement eager conflict detection without the need for cache coherence or signature broadcasts. GETM is up to 2.1 times faster than the state-of-the art prior work WarpTM (gmean 1.2 times), with 3.6 times lower silicon area overheads and 2.2 times lower power overheads.
迄今为止,gpu事务性内存(TM)提案依赖于惰性的、基于值的冲突检测,假设gpu可以通过执行其他扭曲来分摊延迟。然而,在实践中,必须将并发性限制在每个核心的几次扭曲,以避免高中断率,并且TM的性能仍然远远低于细粒度锁。我们将其追溯到验证事务的延迟成本:大多数提交和终止都需要在交叉栏上进行两次往返。由于并发性有限,warp调度器无法分摊这一点,因此大多数时间内核都处于空闲状态。在本文中,我们证明了基于值的验证不能扩展到高线程数,并且随着线程数的增加,渴望冲突检测变得更有效。我们利用这一见解提出了GETM,一种具有渴望冲突检测的GPU TM。GETM依赖于一种新颖的分布式逻辑时钟方案来实现急切冲突检测,而不需要缓存一致性或签名广播。GETM比目前最先进的WarpTM(平均1.2倍)快2.1倍,硅面积开销降低3.6倍,功耗开销降低2.2倍。
{"title":"High-Performance GPU Transactional Memory via Eager Conflict Detection","authors":"X. Ren, Mieszko Lis","doi":"10.1109/HPCA.2018.00029","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00029","url":null,"abstract":"GPUs transactional memory (TM) proposals to date have relied on lazy, value-based conflict detection, assuming that GPUs can amortize the latency by executing other warps. In practice, however, concurrency must be throttled to a few warps per core to avoid high abort rates, and TM performance has remained far below that of fine-grained locks. We trace this to the latency cost of validating transactions: two round trips across the crossbar required for most commits and aborts. With limited concurrency, the warp scheduler cannot amortize this, and leaves the core idle most of the time. In this paper, we show that value-based validation does not scale to high thread counts, and eager conflict detection becomes more efficient as the number of threads grows. We leverage this insight to propose GETM, a GPU TM with eager conflict detection. GETM relies on a novel distributed logical clock scheme to implement eager conflict detection without the need for cache coherence or signature broadcasts. GETM is up to 2.1 times faster than the state-of-the art prior work WarpTM (gmean 1.2 times), with 3.6 times lower silicon area overheads and 2.2 times lower power overheads.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133879537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Characterizing and Mitigating Output Reporting Bottlenecks in Spatial Automata Processing Architectures 空间自动机处理体系结构中输出报告瓶颈的表征与缓解
J. Wadden, K. Angstadt, K. Skadron
Automata processing has seen a resurgence in importance due to its usefulness for pattern matching and pattern mining of "big data." While large-scale automata processing is known to bottleneck von Neumann processors due to unpredictable memory accesses, spatial architectures excel at automata processing. Spatial architectures can implement automata graphs by wiring together automata states in reconfigurable arrays, allowing parallel automata state computation, and point-to-point state transitions on-chip. However, spatial automata processing architectures can suffer from output constraints (up to 255x in commercial systems!) due to the physical placement of states, output processing architecture design, I/O resources, and the massively parallel nature of the architecture. To understand this bottleneck, we conduct the first known characterization of output requirements of a realistic set of automata processing benchmarks. We find that most benchmarks report fairly frequently, but that few states report at any one time. This observation motivates new output compression schemes and reporting architectures. We evaluate the benefit of one purely software automata transformation and show that output reporting costs can be greatly reduced (improving performance by up to 40% without hardware modification. We then explore bottlenecks in the reporting architecture of a commercial spatial automata processor and propose a new architecture that improves performance by up to 5.1x.
由于自动机处理在“大数据”的模式匹配和模式挖掘方面的有用性,它的重要性已经重新出现。由于不可预测的内存访问,大规模自动机处理被认为是冯诺依曼处理器的瓶颈,而空间架构在自动机处理方面表现出色。空间架构可以通过在可重构阵列中连接自动机状态来实现自动机图,允许并行自动机状态计算和芯片上的点到点状态转换。然而,由于状态的物理位置、输出处理体系结构设计、I/O资源以及体系结构的大规模并行特性,空间自动机处理体系结构可能会受到输出限制(在商业系统中高达255倍!)。为了理解这个瓶颈,我们对一组实际的自动机处理基准的输出需求进行了已知的第一个表征。我们发现大多数基准报告相当频繁,但是很少有国家在任何时候报告。这一观察结果激发了新的输出压缩方案和报告架构。我们评估了一个纯软件自动机转换的好处,并表明输出报告成本可以大大降低(在不修改硬件的情况下将性能提高多达40%)。然后,我们探索了商业空间自动机处理器的报告体系结构中的瓶颈,并提出了一种将性能提高5.1倍的新体系结构。
{"title":"Characterizing and Mitigating Output Reporting Bottlenecks in Spatial Automata Processing Architectures","authors":"J. Wadden, K. Angstadt, K. Skadron","doi":"10.1109/HPCA.2018.00069","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00069","url":null,"abstract":"Automata processing has seen a resurgence in importance due to its usefulness for pattern matching and pattern mining of \"big data.\" While large-scale automata processing is known to bottleneck von Neumann processors due to unpredictable memory accesses, spatial architectures excel at automata processing. Spatial architectures can implement automata graphs by wiring together automata states in reconfigurable arrays, allowing parallel automata state computation, and point-to-point state transitions on-chip. However, spatial automata processing architectures can suffer from output constraints (up to 255x in commercial systems!) due to the physical placement of states, output processing architecture design, I/O resources, and the massively parallel nature of the architecture. To understand this bottleneck, we conduct the first known characterization of output requirements of a realistic set of automata processing benchmarks. We find that most benchmarks report fairly frequently, but that few states report at any one time. This observation motivates new output compression schemes and reporting architectures. We evaluate the benefit of one purely software automata transformation and show that output reporting costs can be greatly reduced (improving performance by up to 40% without hardware modification. We then explore bottlenecks in the reporting architecture of a commercial spatial automata processor and propose a new architecture that improves performance by up to 5.1x.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131009351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling 基于多域电压频率标度的GPGPU功率建模
J. Guerreiro, A. Ilic, N. Roma, P. Tomás
Dynamic Voltage and Frequency Scaling (DVFS) on Graphics Processing Units (GPUs) components is one of the most promising power management strategies, due to its potential for significant power and energy savings. However, there is still a lack of simple and reliable models for the estimation of the GPU power consumption under a set of different voltage and frequency levels. Accordingly, a novel GPU power estimation model with both core and memory frequency scaling is herein proposed. This model combines information from both the GPU architecture and the executing GPU application and also takes into account the non-linear changes in the GPU voltage when the core and memory frequencies are scaled. The model parameters are estimated using a collection of 83 microbenchmarks carefully crafted to stress the main GPU components. Based on the hardware performance events gathered during the execution of GPU applications on a single frequency configuration, the proposed model allows to predict the power consumption of the application over a wide range of frequency configurations, as well as to decompose the contribution of different parts of the GPU pipeline to the overall power consumption. Validated on 3 GPU devices from the most recent NVIDIA microarchitectures (Pascal, Maxwell and Kepler), by using a collection of 26 standard benchmarks, the proposed model is able to achieve accurate results (7%, 6% and 12% mean absolute error) for the target GPUs (Titan Xp, GTX Titan X and Tesla K40c).
图形处理单元(gpu)组件上的动态电压和频率缩放(DVFS)是最有前途的电源管理策略之一,因为它具有显著的功耗和节能潜力。然而,对于一组不同电压和频率水平下的GPU功耗估算,目前仍然缺乏简单可靠的模型。在此基础上,提出了一种同时考虑内核和存储器频率缩放的GPU功耗估计模型。该模型结合了来自GPU架构和执行GPU应用程序的信息,并考虑了GPU电压在核心和内存频率缩放时的非线性变化。模型参数是使用83个微基准的集合来估计的,这些微基准是精心设计的,以强调主要GPU组件。基于在单一频率配置下GPU应用程序执行期间收集的硬件性能事件,所提出的模型允许在广泛的频率配置下预测应用程序的功耗,以及分解GPU管道不同部分对总体功耗的贡献。在来自最新NVIDIA微架构(Pascal, Maxwell和Kepler)的3个GPU设备上进行验证,通过使用26个标准基准测试,所提出的模型能够为目标GPU (Titan Xp, GTX Titan X和Tesla K40c)获得准确的结果(7%,6%和12%的平均绝对误差)。
{"title":"GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling","authors":"J. Guerreiro, A. Ilic, N. Roma, P. Tomás","doi":"10.1109/HPCA.2018.00072","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00072","url":null,"abstract":"Dynamic Voltage and Frequency Scaling (DVFS) on Graphics Processing Units (GPUs) components is one of the most promising power management strategies, due to its potential for significant power and energy savings. However, there is still a lack of simple and reliable models for the estimation of the GPU power consumption under a set of different voltage and frequency levels. Accordingly, a novel GPU power estimation model with both core and memory frequency scaling is herein proposed. This model combines information from both the GPU architecture and the executing GPU application and also takes into account the non-linear changes in the GPU voltage when the core and memory frequencies are scaled. The model parameters are estimated using a collection of 83 microbenchmarks carefully crafted to stress the main GPU components. Based on the hardware performance events gathered during the execution of GPU applications on a single frequency configuration, the proposed model allows to predict the power consumption of the application over a wide range of frequency configurations, as well as to decompose the contribution of different parts of the GPU pipeline to the overall power consumption. Validated on 3 GPU devices from the most recent NVIDIA microarchitectures (Pascal, Maxwell and Kepler), by using a collection of 26 standard benchmarks, the proposed model is able to achieve accurate results (7%, 6% and 12% mean absolute error) for the target GPUs (Titan Xp, GTX Titan X and Tesla K40c).","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117062012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
期刊
2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1