Future computing systems will integrate an increasing number of compute elements in processors. Such systems must be designed to efficiently scale up and to provide effective synchronization semantics, fast data movement and resource management. At the same time, it is paramount to understand application characteristics to dimension hardware components and interfaces, while adapting the codes to better exploit performance through those features without wasting area or power. This talk will cover multiple technologies targeted to scale up performance of large processors and research insights around synchronization, coherence, bandwidth and resource management, developed during the co-design effort with HPC codes for future systems.
{"title":"Scaling up performance of fat nodes for HPC","authors":"Alejandro Rico","doi":"10.1145/3310273.3325137","DOIUrl":"https://doi.org/10.1145/3310273.3325137","url":null,"abstract":"Future computing systems will integrate an increasing number of compute elements in processors. Such systems must be designed to efficiently scale up and to provide effective synchronization semantics, fast data movement and resource management. At the same time, it is paramount to understand application characteristics to dimension hardware components and interfaces, while adapting the codes to better exploit performance through those features without wasting area or power. This talk will cover multiple technologies targeted to scale up performance of large processors and research insights around synchronization, coherence, bandwidth and resource management, developed during the co-design effort with HPC codes for future systems.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116746854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a general approach to predict the spatially fine-grained air quality. The model is based on deep bidirectional and unidirectional long short-term memory (DBU-LSTM) neural network, which can capture bidirectional temporal dependencies and spatial correlations from time series data. Urban heterogeneous data such as point of interest (POI) and road network are used to evaluate the similarities between urban regions. The tensor decomposition method is used to complete the missing historical air quality data of monitoring stations. We evaluate our approach on real data sources obtained in Beijing, and the experimental results show its advantages over baseline methods.
{"title":"Spatially fine-grained air quality prediction based on DBU-LSTM","authors":"Liang Ge, Aoli Zhou, Hang Li, Junling Liu","doi":"10.1145/3310273.3322829","DOIUrl":"https://doi.org/10.1145/3310273.3322829","url":null,"abstract":"This paper proposes a general approach to predict the spatially fine-grained air quality. The model is based on deep bidirectional and unidirectional long short-term memory (DBU-LSTM) neural network, which can capture bidirectional temporal dependencies and spatial correlations from time series data. Urban heterogeneous data such as point of interest (POI) and road network are used to evaluate the similarities between urban regions. The tensor decomposition method is used to complete the missing historical air quality data of monitoring stations. We evaluate our approach on real data sources obtained in Beijing, and the experimental results show its advantages over baseline methods.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122787256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce Approximate Unrolling, a compiler loop optimization that reduces execution time and energy consumption, exploiting code regions that can endure some approximation and still produce acceptable results. Specifically, this work focuses on counted loops that map a function over the elements of an array. Approximate Unrolling transforms loops similarly to Loop Unrolling. However, unlike its exact counterpart, our optimization does not unroll loops by adding exact copies of the loop's body. Instead, it adds code that interpolates the results of previous iterations.
{"title":"Approximate loop unrolling","authors":"M. Rodriguez-Cancio, B. Combemale, B. Baudry","doi":"10.1145/3310273.3323841","DOIUrl":"https://doi.org/10.1145/3310273.3323841","url":null,"abstract":"We introduce Approximate Unrolling, a compiler loop optimization that reduces execution time and energy consumption, exploiting code regions that can endure some approximation and still produce acceptable results. Specifically, this work focuses on counted loops that map a function over the elements of an array. Approximate Unrolling transforms loops similarly to Loop Unrolling. However, unlike its exact counterpart, our optimization does not unroll loops by adding exact copies of the loop's body. Instead, it adds code that interpolates the results of previous iterations.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126221848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Wang, Fengkai Yuan, Rui Hou, Jingqiang Lin, Z. Ji, Dan Meng
Modern processor cores share the last-level cache and directory to improve resource utilization. Unfortunately, such sharing makes the cache vulnerable to cross-core cache side channel attacks. Recent studies show that information leakage through cross-core cache side channel attacks is a serious threat in different computing domains ranging from cloud servers and mobile phones to embedded devices. However, previous solutions have limitations of losing performance, lacking golden standards, requiring software support, or being easily bypassed. In this paper, we observe that most cross-core cache side channel attacks cause sensitive data to appear in a ping-pong pattern in continuous attack scenarios, where attackers need to launch numerous attacks in a short period of time. This paper proposes CacheGuard to defend against the continuous attacks. CacheGuard extends the directory architecture for capturing the ping-pong patterns. Once the ping-pong pattern of a cache line is captured, Cache-Guard can secure the line with two pattern-oriented counteractions, Preload and Lock. The experimental evaluation demonstrates that CacheGuard can block the continuous attacks, and that it induces negligible performance degradation and hardware overhead.
{"title":"CacheGuard: a security-enhanced directory architecture against continuous attacks","authors":"Kai Wang, Fengkai Yuan, Rui Hou, Jingqiang Lin, Z. Ji, Dan Meng","doi":"10.1145/3310273.3323051","DOIUrl":"https://doi.org/10.1145/3310273.3323051","url":null,"abstract":"Modern processor cores share the last-level cache and directory to improve resource utilization. Unfortunately, such sharing makes the cache vulnerable to cross-core cache side channel attacks. Recent studies show that information leakage through cross-core cache side channel attacks is a serious threat in different computing domains ranging from cloud servers and mobile phones to embedded devices. However, previous solutions have limitations of losing performance, lacking golden standards, requiring software support, or being easily bypassed. In this paper, we observe that most cross-core cache side channel attacks cause sensitive data to appear in a ping-pong pattern in continuous attack scenarios, where attackers need to launch numerous attacks in a short period of time. This paper proposes CacheGuard to defend against the continuous attacks. CacheGuard extends the directory architecture for capturing the ping-pong patterns. Once the ping-pong pattern of a cache line is captured, Cache-Guard can secure the line with two pattern-oriented counteractions, Preload and Lock. The experimental evaluation demonstrates that CacheGuard can block the continuous attacks, and that it induces negligible performance degradation and hardware overhead.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131423282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CPU/GPU frequency scheduling on smartphones that maintains users' quality of experience (QoE) while reducing power consumption has been studied extensively in the past. Most previous works focused on power-hungry applications such as video streaming or 3D games. However, the majority of people are light to medium users, using applications such as social networking, web browsing, etc. For such interactive applications, it is difficult to reduce power consumption, because their behaviors depend on the user's interactions and are hard to characterize. In this paper, we tackle this challenging problem by considering the influences of user contexts on their interaction behaviors. A context-aware CPU/GPU frequency scheduling governor is proposed that allocates CPU/GPU frequencies just enough to meet the workload under different stages of user interaction. Evaluations show that the proposed governor can save power consumption up to 25% compared to the default governor while keeping the users satisfied with the QoE.
{"title":"User-centered context-aware CPU/GPU power management for interactive applications on smartphones","authors":"Syuan-Yi Lin, C. King","doi":"10.1145/3310273.3322825","DOIUrl":"https://doi.org/10.1145/3310273.3322825","url":null,"abstract":"CPU/GPU frequency scheduling on smartphones that maintains users' quality of experience (QoE) while reducing power consumption has been studied extensively in the past. Most previous works focused on power-hungry applications such as video streaming or 3D games. However, the majority of people are light to medium users, using applications such as social networking, web browsing, etc. For such interactive applications, it is difficult to reduce power consumption, because their behaviors depend on the user's interactions and are hard to characterize. In this paper, we tackle this challenging problem by considering the influences of user contexts on their interaction behaviors. A context-aware CPU/GPU frequency scheduling governor is proposed that allocates CPU/GPU frequencies just enough to meet the workload under different stages of user interaction. Evaluations show that the proposed governor can save power consumption up to 25% compared to the default governor while keeping the users satisfied with the QoE.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130412449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most of the modern processors perform out-of-order speculative executions to maximise system performance. Spectre and Meltdown exploit these optimisations and execute certain instructions leading to leakage of confidential information of the victim. All the variants of this class of attacks necessarily exploit branch prediction or speculative execution. Using this insight, we develop a two step strategy to effectively detect these attacks using performance counter statistics, correlation coefficient model, deep neural network and fast Fourier transform. Our approach is expected to provide reliable, fast and highly accurate results with no perceivable loss in system performance or system overhead.
{"title":"Performance statistics and learning based detection of exploitative speculative attacks","authors":"Swastika Dutta, S. Sinha","doi":"10.1145/3310273.3322832","DOIUrl":"https://doi.org/10.1145/3310273.3322832","url":null,"abstract":"Most of the modern processors perform out-of-order speculative executions to maximise system performance. Spectre and Meltdown exploit these optimisations and execute certain instructions leading to leakage of confidential information of the victim. All the variants of this class of attacks necessarily exploit branch prediction or speculative execution. Using this insight, we develop a two step strategy to effectively detect these attacks using performance counter statistics, correlation coefficient model, deep neural network and fast Fourier transform. Our approach is expected to provide reliable, fast and highly accurate results with no perceivable loss in system performance or system overhead.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131931903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays distributed graph computing is widely used to process large amount of data on the internet. Communication overhead is a critical factor in determining the overall efficiency of graph algorithms. Through speculative prediction of the content of communications, we develop an optimization technique to significantly reduce the amount of communications needed for a class of graph algorithms. We have evaluated our optimization technique using five graph algorithms, Single-source shortest path, Connected Components, PageRank, Diameter, and Random Walk, on the Amazon EC2 clusters using different graph datasets. Our optimized implementations have reduced communication overhead by 21--93% for these algorithms, while keeping the error rates under 5%.
{"title":"Accelerating parallel graph computing with speculation","authors":"Shuo Ji, Yinliang Zhao, Qing Yi","doi":"10.1145/3310273.3323049","DOIUrl":"https://doi.org/10.1145/3310273.3323049","url":null,"abstract":"Nowadays distributed graph computing is widely used to process large amount of data on the internet. Communication overhead is a critical factor in determining the overall efficiency of graph algorithms. Through speculative prediction of the content of communications, we develop an optimization technique to significantly reduce the amount of communications needed for a class of graph algorithms. We have evaluated our optimization technique using five graph algorithms, Single-source shortest path, Connected Components, PageRank, Diameter, and Random Walk, on the Amazon EC2 clusters using different graph datasets. Our optimized implementations have reduced communication overhead by 21--93% for these algorithms, while keeping the error rates under 5%.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131587660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Bernaschi, Alessandro Celestini, Stefano Guarino, F. Lombardi, Enrico Mastrostefano
The exploration and analysis of Web graphs has flourished in the recent past, producing a large number of relevant and interesting research results. However, the unique characteristics of the Tor network demand for specific algorithms to explore and analyze it. Tor is an anonymity network that allows offering and accessing various Internet resources while guaranteeing a high degree of provider and user anonymity. So far the attention of the research community has focused on assessing the security of the Tor infrastructure. Most research work on the Tor network aimed at discovering protocol vulnerabilities to de-anonymize users and services, while little or no information is available about the topology of the Tor Web graph or the relationship between pages' content and topological structure. With our work we aim at addressing such lack of information. We describe the topology of the Tor Web graph measuring both global and local properties by means of well-known metrics that require due to the size of the network, high performance algorithms. We consider three different snapshots obtained by extensively crawling Tor three times over a 5 months time frame. Finally we present a correlation analysis of pages' semantics and topology, discussing novel insights about the Tor Web organization and its content. Our findings show that the Tor graph presents some of the characteristics of social and surface web graphs, along with a few unique peculiarities.
{"title":"Analysing the tor web with high performance graph algorithms","authors":"M. Bernaschi, Alessandro Celestini, Stefano Guarino, F. Lombardi, Enrico Mastrostefano","doi":"10.1145/3310273.3323918","DOIUrl":"https://doi.org/10.1145/3310273.3323918","url":null,"abstract":"The exploration and analysis of Web graphs has flourished in the recent past, producing a large number of relevant and interesting research results. However, the unique characteristics of the Tor network demand for specific algorithms to explore and analyze it. Tor is an anonymity network that allows offering and accessing various Internet resources while guaranteeing a high degree of provider and user anonymity. So far the attention of the research community has focused on assessing the security of the Tor infrastructure. Most research work on the Tor network aimed at discovering protocol vulnerabilities to de-anonymize users and services, while little or no information is available about the topology of the Tor Web graph or the relationship between pages' content and topological structure. With our work we aim at addressing such lack of information. We describe the topology of the Tor Web graph measuring both global and local properties by means of well-known metrics that require due to the size of the network, high performance algorithms. We consider three different snapshots obtained by extensively crawling Tor three times over a 5 months time frame. Finally we present a correlation analysis of pages' semantics and topology, discussing novel insights about the Tor Web organization and its content. Our findings show that the Tor graph presents some of the characteristics of social and surface web graphs, along with a few unique peculiarities.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114424502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Strati, Christina Giannoula, Dimitrios Siakavaras, G. Goumas, N. Koziris
Designing scalable concurrent priority queues for contemporary NUMA servers is challenging. Several NUMA-unaware implementations can scale up to a high number of threads exploiting the potential parallelism of the insert operations. In contrast, in deleteMin-dominated workloads, threads compete for accessing the same memory locations, i.e. the first item in the priority queue. In such cases, NUMA-aware implementations are typically used, since they reduce the coherence traffic between the nodes of a NUMA system. In this work, we propose an adaptive priority queue, called SmartPQ, that tunes itself by automatically switching between NUMA-unaware and NUMA-aware algorithmic modes to provide the highest available performance under all workloads. SmartPQ is built on top of NUMA Node Delegation (Nuddle), a low overhead technique to construct NUMA-aware data structures using any arbitrary NUMA-unaware implementation as its backbone. Moreover, SmartPQ employs machine learning to decide when to switch between its two algorithmic modes. As our evaluation reveals, it achieves the highest available performance with 88% success rate and dynamically adapts between a NUMA-aware and a NUMA-unaware mode, without overheads, while performing up to 1.83 times better performance than Spraylist, the state-of-the-art NUMA-unaware priority queue.
{"title":"An adaptive concurrent priority queue for NUMA architectures","authors":"F. Strati, Christina Giannoula, Dimitrios Siakavaras, G. Goumas, N. Koziris","doi":"10.1145/3310273.3323164","DOIUrl":"https://doi.org/10.1145/3310273.3323164","url":null,"abstract":"Designing scalable concurrent priority queues for contemporary NUMA servers is challenging. Several NUMA-unaware implementations can scale up to a high number of threads exploiting the potential parallelism of the insert operations. In contrast, in deleteMin-dominated workloads, threads compete for accessing the same memory locations, i.e. the first item in the priority queue. In such cases, NUMA-aware implementations are typically used, since they reduce the coherence traffic between the nodes of a NUMA system. In this work, we propose an adaptive priority queue, called SmartPQ, that tunes itself by automatically switching between NUMA-unaware and NUMA-aware algorithmic modes to provide the highest available performance under all workloads. SmartPQ is built on top of NUMA Node Delegation (Nuddle), a low overhead technique to construct NUMA-aware data structures using any arbitrary NUMA-unaware implementation as its backbone. Moreover, SmartPQ employs machine learning to decide when to switch between its two algorithmic modes. As our evaluation reveals, it achieves the highest available performance with 88% success rate and dynamically adapts between a NUMA-aware and a NUMA-unaware mode, without overheads, while performing up to 1.83 times better performance than Spraylist, the state-of-the-art NUMA-unaware priority queue.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128087334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-Volatile Memory (NVM) such as PCM has emerged as a potential alternative for main memory due to its high density and low leakage power. However, an NVM main-memory system faces three challenges when compared to Dynamic Random Access Memory (DRAM) - long latency, poor write endurance and data security. To address these three challenges, we propose a secure DRAM+NVM hybrid memory module. The hybrid module integrates a DRAM cache and a security unit (SU). DRAM cache can improve the performance of an NVM memory module and reduce the number of direct writes to the NVM. Our results show that a 256MB 2-way DRAM cache with a 1024B cache line performs well in an 8GB NVM main memory module. The SU is embedded in the onboard controller and includes an AES-GCM engine and an NVM vault. The AES-GCM engine implements encryption and authentication with low overhead. The NVM vault is used to store MAC tags and counter values for each DRAM cache line. According to our results, the proposed secure hybrid memory module improves the performance by 32% compared to an NVM-only memory module, and is only 6.8% slower than a DRAM only memory module.
非易失性存储器(NVM)(如 PCM)因其高密度和低漏电功率而成为主存储器的潜在替代品。然而,与动态随机存取存储器(DRAM)相比,NVM 主存储器系统面临着三个挑战--延迟长、写入耐久性差和数据安全性。为了应对这三大挑战,我们提出了一种安全的 DRAM+NVM 混合内存模块。该混合模块集成了 DRAM 高速缓存和安全单元(SU)。DRAM 缓存可以提高 NVM 内存模块的性能,并减少直接写入 NVM 的次数。我们的研究结果表明,在 8GB NVM 主存储器模块中,具有 1024B 缓存线的 256MB 双向 DRAM 缓存性能良好。SU 嵌入在板载控制器中,包括一个 AES-GCM 引擎和一个 NVM 存储库。AES-GCM 引擎以较低的开销实现加密和验证。NVM 存储库用于存储每个 DRAM 高速缓存行的 MAC 标记和计数器值。根据我们的研究结果,与纯 NVM 存储模块相比,拟议的安全混合存储模块的性能提高了 32%,与纯 DRAM 存储模块相比,仅慢 6.8%。
{"title":"Designing a secure DRAM+NVM hybrid memory module","authors":"Xu Wang, I. Koren","doi":"10.1145/3310273.3323069","DOIUrl":"https://doi.org/10.1145/3310273.3323069","url":null,"abstract":"Non-Volatile Memory (NVM) such as PCM has emerged as a potential alternative for main memory due to its high density and low leakage power. However, an NVM main-memory system faces three challenges when compared to Dynamic Random Access Memory (DRAM) - long latency, poor write endurance and data security. To address these three challenges, we propose a secure DRAM+NVM hybrid memory module. The hybrid module integrates a DRAM cache and a security unit (SU). DRAM cache can improve the performance of an NVM memory module and reduce the number of direct writes to the NVM. Our results show that a 256MB 2-way DRAM cache with a 1024B cache line performs well in an 8GB NVM main memory module. The SU is embedded in the onboard controller and includes an AES-GCM engine and an NVM vault. The AES-GCM engine implements encryption and authentication with low overhead. The NVM vault is used to store MAC tags and counter values for each DRAM cache line. According to our results, the proposed secure hybrid memory module improves the performance by 32% compared to an NVM-only memory module, and is only 6.8% slower than a DRAM only memory module.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129158182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}