Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献

英文中文

Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product 理解大规模深度推荐模型训练的数据存储和摄取:工业产品

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2021-08-20 DOI: 10.1145/3470496.3533044

Mark Zhao, Niket Agarwal, Aarti Basant, B. Gedik, Satadru Pan, Muhammet Mustafa Ozdal, Rakesh Komuravelli, Jerry Y. Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, C. Kozyrakis, P. Pol

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.

由数千个特定领域加速器(DSA)组成的数据中心规模的人工智能训练集群用于训练日益复杂的深度学习模型。这些集群依赖于数据存储和摄取(DSI)管道，该管道负责存储eb级的训练数据，并以每秒数十tb的速度提供服务。随着DSI不断提高培训效率和吞吐量，DSI管道正在成为制约整体培训绩效和能力的主要因素。提高DSI系统和硬件的效率和性能的创新迫在眉睫，需要对DSI特性和大规模基础设施有深入的了解。本文介绍了Meta的端到端DSI管道，由建立在分布式存储上的中央数据仓库和可扩展以消除数据停滞的数据预处理服务组成。我们描述了如何通过多样化和连续的培训工作跨地理分布式数据中心协作训练数百个模型。这些训练任务读取并过滤大量不断发展的数据集，从而产生跨训练任务使用的流行特征和样本。我们测量了每个训练任务在训练过程中预处理样本所需的密集网络、内存和计算资源。最后，我们根据我们的生产基础设施特征综合了关键要点。其中包括识别硬件瓶颈，讨论异构DSI硬件的机会，激励数据中心调度和基准数据集的研究，以及吸收优化DSI基础设施的经验教训。

{"title":"Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product","authors":"Mark Zhao, Niket Agarwal, Aarti Basant, B. Gedik, Satadru Pan, Muhammet Mustafa Ozdal, Rakesh Komuravelli, Jerry Y. Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, C. Kozyrakis, P. Pol","doi":"10.1145/3470496.3533044","DOIUrl":"https://doi.org/10.1145/3470496.3533044","url":null,"abstract":"Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121217696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

2QAN: a quantum compiler for 2-local qubit hamiltonian simulation algorithms 2QAN: 2-local量子比特哈密顿模拟算法的量子编译器

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2021-08-04 DOI: 10.1145/3470496.3527394

L. Lao, D. Browne

Simulating quantum systems is one of the most important potential applications of quantum computers. The high-level circuit defining the simulation needs to be compiled into one that complies with hardware limitations such as qubit architecture (connectivity) and instruction (gate) set. General-purpose quantum compilers work at the gate level and have little knowledge of the mathematical properties of quantum applications, missing further optimization opportunities. Existing application-specific compilers only apply advanced optimizations in the scheduling procedure and are restricted to the CNOT or CZ gate set. In this work, we develop a compiler, named 2QAN, to optimize quantum circuits for 2-local qubit Hamiltonian simulation problems, a framework which includes the important quantum approximate optimization algorithm (QAOA). In particular, we exploit the flexibility of permuting different operators in the Hamiltonian (no matter whether they commute) and propose permutation-aware techniques for qubit routing, gate optimization and scheduling to minimize compilation overhead. 2QAN can target different architectures and different instruction sets. Compilation results on four applications (up to 50 qubits) and three quantum computers (namely, Google Sycamore, IBMQ Montreal and Rigetti Aspen) show that 2QAN outperforms state-of-the-art general-purpose compilers and application-specific compilers. Specifically, 2QAN can reduce the number of inserted SWAP gates by 11.5X, reduce overhead in hardware gate count by 68.5X, and reduce overhead in circuit depth by 21X. Experimental results on the Montreal device demonstrate that benchmarks compiled by 2QAN achieve the highest fidelity.

模拟量子系统是量子计算机最重要的潜在应用之一。定义仿真的高级电路需要编译成符合硬件限制的电路，例如量子比特架构(连通性)和指令(门)集。通用量子编译器在门级工作，对量子应用的数学特性知之甚少，因此错过了进一步优化的机会。现有的特定于应用程序的编译器仅在调度过程中应用高级优化，并且仅限于CNOT或CZ门集。在这项工作中，我们开发了一个名为2QAN的编译器，用于优化2局部量子比特哈密顿模拟问题的量子电路，该框架包括重要的量子近似优化算法(QAOA)。特别是，我们利用了在哈密顿算子中排列不同算子的灵活性(无论它们是否通勤)，并提出了用于量子比特路由、门优化和调度的排列感知技术，以最大限度地减少编译开销。qan可以针对不同的架构和不同的指令集。在四个应用程序(最多50个量子位)和三个量子计算机(即Google Sycamore, IBMQ Montreal和Rigetti Aspen)上的编译结果表明，2QAN优于最先进的通用编译器和特定于应用程序的编译器。具体来说，2QAN可以将插入的SWAP门数量减少11.5倍，将硬件门数量的开销减少68.5倍，将电路深度的开销减少21倍。在蒙特利尔设备上的实验结果表明，2QAN编写的基准达到了最高的保真度。

{"title":"2QAN: a quantum compiler for 2-local qubit hamiltonian simulation algorithms","authors":"L. Lao, D. Browne","doi":"10.1145/3470496.3527394","DOIUrl":"https://doi.org/10.1145/3470496.3527394","url":null,"abstract":"Simulating quantum systems is one of the most important potential applications of quantum computers. The high-level circuit defining the simulation needs to be compiled into one that complies with hardware limitations such as qubit architecture (connectivity) and instruction (gate) set. General-purpose quantum compilers work at the gate level and have little knowledge of the mathematical properties of quantum applications, missing further optimization opportunities. Existing application-specific compilers only apply advanced optimizations in the scheduling procedure and are restricted to the CNOT or CZ gate set. In this work, we develop a compiler, named 2QAN, to optimize quantum circuits for 2-local qubit Hamiltonian simulation problems, a framework which includes the important quantum approximate optimization algorithm (QAOA). In particular, we exploit the flexibility of permuting different operators in the Hamiltonian (no matter whether they commute) and propose permutation-aware techniques for qubit routing, gate optimization and scheduling to minimize compilation overhead. 2QAN can target different architectures and different instruction sets. Compilation results on four applications (up to 50 qubits) and three quantum computers (namely, Google Sycamore, IBMQ Montreal and Rigetti Aspen) show that 2QAN outperforms state-of-the-art general-purpose compilers and application-specific compilers. Specifically, 2QAN can reduce the number of inserted SWAP gates by 11.5X, reduce overhead in hardware gate count by 68.5X, and reduce overhead in circuit depth by 21X. Experimental results on the Montreal device demonstrate that benchmarks compiled by 2QAN achieve the highest fidelity.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123753829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Software-hardware co-design for fast and scalable training of deep learning recommendation models 用于深度学习推荐模型快速和可扩展训练的软硬件协同设计

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2021-04-12 DOI: 10.1145/3470496.3533727

Dheevatsa Mudigere, Y. Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liangchen Luo, J. Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, E. K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yi-An Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishnaveni Dhulipala, Kranthi G. Kishore, Tyler N. Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, M. Krishnan, A. Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, P. Bhattacharya, Petr Lapukhov, M. Naumov, A. Mathews, Lin Qiao, M. Smelyanskiy, Bill Jia, Vijay Rao

Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memory-efficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX, a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.

深度学习推荐模型(dlrm)已经在Meta的许多关键业务服务中使用，并且就其数据中心的基础设施需求而言，是单个最大的人工智能应用程序。在本文中，我们提出了一个用于大规模dlrm高性能分布式训练的软硬件协同设计系统Neo。Neo采用了一种新颖的4D并行策略，结合了表、行、列和数据并行性，用于训练dlrm中的大量嵌入算子。此外，Neo使用各种关键的系统优化，包括混合内核融合、软件管理缓存和质量保持压缩，实现了极其高性能和内存高效的嵌入计算。最后，Neo与ZionEX配对，ZionEX是一个新的硬件平台，与Neo的4D并行性共同设计，用于优化大规模DLRM培训的通信。我们对使用16个ZionEX节点的128个gpu的评估表明，在训练部署在生产中的12万亿参数DLRM模型时，Neo的性能比现有系统高出40倍。

{"title":"Software-hardware co-design for fast and scalable training of deep learning recommendation models","authors":"Dheevatsa Mudigere, Y. Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liangchen Luo, J. Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, E. K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yi-An Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishnaveni Dhulipala, Kranthi G. Kishore, Tyler N. Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, M. Krishnan, A. Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, P. Bhattacharya, Petr Lapukhov, M. Naumov, A. Mathews, Lin Qiao, M. Smelyanskiy, Bill Jia, Vijay Rao","doi":"10.1145/3470496.3533727","DOIUrl":"https://doi.org/10.1145/3470496.3533727","url":null,"abstract":"Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memory-efficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX, a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122805695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

PS-ORAM: efficient crash consistency support for oblivious RAM on NVM PS-ORAM:对NVM上无关内存的高效崩溃一致性支持

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2020-11-07 DOI: 10.1145/3470496.3527425

Gang Liu, KenLi Li, Zheng Xiao, Rujia Wang

Oblivious RAM (ORAM) is a provable secure primitive to prevent access pattern leakage on the memory bus. By randomly remapping the data blocks and accessing redundant blocks, ORAM prevents access pattern leakage through ob-fuscation. Byte-addressable non-volatile memory (NVM) is considered as the candidate for main memory due to its better scalability, competitive performance, and persistent data store. While there is much prior work focusing on improving ORAM's performance on the conventional DRAM-based memory system, when the memory technology shifts to use NVM, ensuring an efficient crash-consistent ORAM is needed for security, correctness, and performance. Directly using traditional software-based crash consistency support for ORAM system is not only expensive but also insecure. In this work, we study how to persist ORAM construction with an NVM-based memory system. To support crash consistency without damaging ORAM system security and compromising the performance, we propose PS-ORAM. PS-ORAM consists of a novel ORAM controller design and a set of ORAM access protocols that support crash consistency. We evaluate PS-ORAM with the system without crash consistency support, non-recursive and recursive PS-ORAM only incurs 4.29% and 3.65% additional performance overhead. The results show that PS-ORAM not only supports effective crash consistency with minimal performance and hardware overhead but also is friendly to NVM lifetime.

遗忘RAM (ORAM)是一种可验证的安全原语，用于防止内存总线上的访问模式泄漏。通过随机重新映射数据块和访问冗余块，ORAM可以防止通过ob-fuscation导致访问模式泄漏。字节可寻址非易失性内存(NVM)被认为是主存的候选，因为它具有更好的可伸缩性、竞争性性能和持久的数据存储。虽然之前有很多工作集中在提高ORAM在传统的基于dram的内存系统上的性能上，但当内存技术转向使用NVM时，为了安全性、正确性和性能，需要确保一个高效的崩溃一致的ORAM。对ORAM系统直接使用传统的基于软件的崩溃一致性支持不仅成本高，而且不安全。在这项工作中，我们研究了如何在基于nvm的存储系统中持久化ORAM结构。为了在不破坏ORAM系统安全性和性能的情况下支持崩溃一致性，我们提出了PS-ORAM。PS-ORAM由一种新颖的ORAM控制器设计和一组支持崩溃一致性的ORAM访问协议组成。我们在不支持崩溃一致性的系统下对PS-ORAM进行了评估，非递归和递归PS-ORAM仅产生4.29%和3.65%的额外性能开销。结果表明，PS-ORAM不仅以最小的性能和硬件开销支持有效的崩溃一致性，而且对NVM寿命也很友好。

{"title":"PS-ORAM: efficient crash consistency support for oblivious RAM on NVM","authors":"Gang Liu, KenLi Li, Zheng Xiao, Rujia Wang","doi":"10.1145/3470496.3527425","DOIUrl":"https://doi.org/10.1145/3470496.3527425","url":null,"abstract":"Oblivious RAM (ORAM) is a provable secure primitive to prevent access pattern leakage on the memory bus. By randomly remapping the data blocks and accessing redundant blocks, ORAM prevents access pattern leakage through ob-fuscation. Byte-addressable non-volatile memory (NVM) is considered as the candidate for main memory due to its better scalability, competitive performance, and persistent data store. While there is much prior work focusing on improving ORAM's performance on the conventional DRAM-based memory system, when the memory technology shifts to use NVM, ensuring an efficient crash-consistent ORAM is needed for security, correctness, and performance. Directly using traditional software-based crash consistency support for ORAM system is not only expensive but also insecure. In this work, we study how to persist ORAM construction with an NVM-based memory system. To support crash consistency without damaging ORAM system security and compromising the performance, we propose PS-ORAM. PS-ORAM consists of a novel ORAM controller design and a set of ORAM access protocols that support crash consistency. We evaluate PS-ORAM with the system without crash consistency support, non-recursive and recursive PS-ORAM only incurs 4.29% and 3.65% additional performance overhead. The results show that PS-ORAM not only supports effective crash consistency with minimal performance and hardware overhead but also is friendly to NVM lifetime.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131112086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

MGX: near-zero overhead memory protection for data-intensive accelerators MGX:为数据密集型加速器提供近乎零开销的内存保护

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2020-04-20 DOI: 10.1145/3470496.3527418

Weizhe Hua, M. Umar, Zhiru Zhang, G. Suh

This paper introduces MGX, a near-zero overhead memory protection scheme for hardware accelerators. MGX minimizes the performance overhead of off-chip memory encryption and integrity verification by exploiting the application-specific properties of the accelerator execution. In particular, accelerators tend to explicitly manage data movement between on-chip and off-chip memories. Therefore, the general memory access pattern of an accelerator can largely be determined for a given application. Exploiting these characteristics, MGX generates version numbers used in memory encryption and integrity verification using on-chip accelerator state rather than storing them in the off-chip memory; it also customizes the granularity of the memory protection to match the granularity used by the accelerator. To demonstrate the efficacy of MGX, we present an in-depth study of MGX for DNN and graph algorithms. Experimental results show that on average, MGX lowers the performance overhead of memory protection from 28% and 33% to 4% and 5% for DNN and graph processing accelerators in a wide range of benchmarks, respectively.

本文介绍了一种用于硬件加速器的近零开销内存保护方案MGX。MGX通过利用加速器执行的特定于应用程序的属性，将片外内存加密和完整性验证的性能开销降至最低。特别是，加速器倾向于显式地管理片内和片外存储器之间的数据移动。因此，对于给定的应用程序，可以在很大程度上确定加速器的一般内存访问模式。利用这些特性，MGX使用片上加速器状态生成用于内存加密和完整性验证的版本号，而不是将它们存储在片外存储器中;它还自定义内存保护的粒度，以匹配加速器使用的粒度。为了证明MGX的有效性，我们对MGX在DNN和图算法中的应用进行了深入的研究。实验结果表明，在广泛的基准测试中，MGX平均将DNN和图形处理加速器的内存保护性能开销分别从28%和33%降低到4%和5%。

{"title":"MGX: near-zero overhead memory protection for data-intensive accelerators","authors":"Weizhe Hua, M. Umar, Zhiru Zhang, G. Suh","doi":"10.1145/3470496.3527418","DOIUrl":"https://doi.org/10.1145/3470496.3527418","url":null,"abstract":"This paper introduces MGX, a near-zero overhead memory protection scheme for hardware accelerators. MGX minimizes the performance overhead of off-chip memory encryption and integrity verification by exploiting the application-specific properties of the accelerator execution. In particular, accelerators tend to explicitly manage data movement between on-chip and off-chip memories. Therefore, the general memory access pattern of an accelerator can largely be determined for a given application. Exploiting these characteristics, MGX generates version numbers used in memory encryption and integrity verification using on-chip accelerator state rather than storing them in the off-chip memory; it also customizes the granularity of the memory protection to match the granularity used by the accelerator. To demonstrate the efficacy of MGX, we present an in-depth study of MGX for DNN and graph algorithms. Experimental results show that on average, MGX lowers the performance overhead of memory protection from 28% and 33% to 4% and 5% for DNN and graph processing accelerators in a wide range of benchmarks, respectively.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129632654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀