As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive "normal" NoC routers which only transmit packets into active "big" ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.
{"title":"iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores","authors":"Y. Yao, Zhonghai Lu","doi":"10.1109/HPCA.2018.00012","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00012","url":null,"abstract":"As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive \"normal\" NoC routers which only transmit packets into active \"big\" ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133143044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Hazelwood, Sarah Bird, D. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, P. Noordhuis, M. Smelyanskiy, Liang Xiong, Xiaodong Wang
Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at global scale. Facebook's machine learning workloads are extremely diverse: services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.
{"title":"Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective","authors":"K. Hazelwood, Sarah Bird, D. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, P. Noordhuis, M. Smelyanskiy, Liang Xiong, Xiaodong Wang","doi":"10.1109/HPCA.2018.00059","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00059","url":null,"abstract":"Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at global scale. Facebook's machine learning workloads are extremely diverse: services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128799598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Wei-gong Zhang, Jing Wang, Tao Li
Recent years have seen an exploration of data volumes from a myriad of IoT devices, such as various sensors and ubiquitous cameras. The deluge of IoT data creates enormous opportunities for us to explore the physical world, especially with the help of deep learning techniques. Traditionally, the Cloud is the option for deploying deep learning based applications. However, the challenges of Cloud-centric IoT systems are increasing due to significant data movement overhead, escalating energy needs, and privacy issues. Rather than constantly moving a tremendous amount of raw data to the Cloud, it would be beneficial to leverage the emerging powerful IoT devices to perform the inference task. Nevertheless, the statically trained model could not efficiently handle the dynamic data in the real in-situ environments, which leads to low accuracy. Moreover, the big raw IoT data challenges the traditional supervised training method in the Cloud. To tackle the above challenges, we propose In-situ AI, the first Autonomous and Incremental computing framework and architecture for deep learning based IoT applications. We equip deep learning based IoT system with autonomous IoT data diagnosis (minimize data movement), and incremental and unsupervised training method (tackle the big raw IoT data generated in ever-changing in-situ environments). To provide efficient architectural support for this new computing paradigm, we first characterize the two In-situ AI tasks (i.e. inference and diagnosis tasks) on two popular IoT devices (i.e. mobile GPU and FPGA) and explore the design space and tradeoffs. Based on the characterization results, we propose two working modes for the In-situ AI tasks, including Single-running and Co-running modes. Moreover, we craft analytical models for these two modes to guide the best configuration selection. We also develop a novel two-level weight shared In-situ AI architecture to efficiently deploy In-situ tasks to IoT node. Compared with traditional IoT systems, our In-situ AI can reduce data movement by 28-71%, which further yields 1.4X-3.3X speedup on model update and contributes to 30-70% energy saving.
{"title":"In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems","authors":"Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Wei-gong Zhang, Jing Wang, Tao Li","doi":"10.1109/HPCA.2018.00018","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00018","url":null,"abstract":"Recent years have seen an exploration of data volumes from a myriad of IoT devices, such as various sensors and ubiquitous cameras. The deluge of IoT data creates enormous opportunities for us to explore the physical world, especially with the help of deep learning techniques. Traditionally, the Cloud is the option for deploying deep learning based applications. However, the challenges of Cloud-centric IoT systems are increasing due to significant data movement overhead, escalating energy needs, and privacy issues. Rather than constantly moving a tremendous amount of raw data to the Cloud, it would be beneficial to leverage the emerging powerful IoT devices to perform the inference task. Nevertheless, the statically trained model could not efficiently handle the dynamic data in the real in-situ environments, which leads to low accuracy. Moreover, the big raw IoT data challenges the traditional supervised training method in the Cloud. To tackle the above challenges, we propose In-situ AI, the first Autonomous and Incremental computing framework and architecture for deep learning based IoT applications. We equip deep learning based IoT system with autonomous IoT data diagnosis (minimize data movement), and incremental and unsupervised training method (tackle the big raw IoT data generated in ever-changing in-situ environments). To provide efficient architectural support for this new computing paradigm, we first characterize the two In-situ AI tasks (i.e. inference and diagnosis tasks) on two popular IoT devices (i.e. mobile GPU and FPGA) and explore the design space and tradeoffs. Based on the characterization results, we propose two working modes for the In-situ AI tasks, including Single-running and Co-running modes. Moreover, we craft analytical models for these two modes to guide the best configuration selection. We also develop a novel two-level weight shared In-situ AI architecture to efficiently deploy In-situ tasks to IoT node. Compared with traditional IoT systems, our In-situ AI can reduce data movement by 28-71%, which further yields 1.4X-3.3X speedup on model update and contributes to 30-70% energy saving.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122073943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing has evolved into a promising computing paradigm. However, it remains a challenging task to protect application privacy and, in particular, the memory access patterns, on cloud servers. The Path ORAM protocol achieves high-level privacy protection but requires large memory bandwidth, which introduces severe execution interference. The recently proposed secure memory model greatly reduces the security enhancement overhead but demands the secure integration of cryptographic logic and memory devices, a memory architecture that is yet to prevail in mainstream cloud servers.,,,, In this paper, we propose D-ORAM, a novel Path ORAM scheme for achieving high-level privacy protection and low execution interference on cloud servers with untrusted memory. D-ORAM leverages the buffer-on-board (BOB) memory architecture to offload the Path ORAM primitives to a secure engine in the BOB unit, which greatly alleviates the contention for the off-chip memory bus between secure and non-secure applications. D-ORAM upgrades only one secure memory channel and employs Path ORAM tree split to extend the secure application flexibly across multiple channels, in particular, the non-secure channels. D-ORAM optimizes the link utilization to further improve the system performance. Our evaluation shows that D-ORAM effectively protects application privacy on mainstream computing servers with untrusted memory, with an improvement of NS-App performance by 22.5% on average over the Path ORAM baseline.
{"title":"D-ORAM: Path-ORAM Delegation for Low Execution Interference on Cloud Servers with Untrusted Memory","authors":"Rujia Wang, Youtao Zhang, Jun Yang","doi":"10.1109/HPCA.2018.00043","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00043","url":null,"abstract":"Cloud computing has evolved into a promising computing paradigm. However, it remains a challenging task to protect application privacy and, in particular, the memory access patterns, on cloud servers. The Path ORAM protocol achieves high-level privacy protection but requires large memory bandwidth, which introduces severe execution interference. The recently proposed secure memory model greatly reduces the security enhancement overhead but demands the secure integration of cryptographic logic and memory devices, a memory architecture that is yet to prevail in mainstream cloud servers.,,,, In this paper, we propose D-ORAM, a novel Path ORAM scheme for achieving high-level privacy protection and low execution interference on cloud servers with untrusted memory. D-ORAM leverages the buffer-on-board (BOB) memory architecture to offload the Path ORAM primitives to a secure engine in the BOB unit, which greatly alleviates the contention for the off-chip memory bus between secure and non-secure applications. D-ORAM upgrades only one secure memory channel and employs Path ORAM tree split to extend the secure application flexibly across multiple channels, in particular, the non-secure channels. D-ORAM optimizes the link utilization to further improve the system performance. Our evaluation shows that D-ORAM effectively protects application privacy on mainstream computing servers with untrusted memory, with an improvement of NS-App performance by 22.5% on average over the Path ORAM baseline.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132147842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, deep learning based approaches have emerged as indispensable tools to perform big data analytics. Normally, deep learning models are first trained with a supervised method and then deployed to execute various tasks. The supervised method involves extensive human efforts to collect and label the large-scale dataset, which becomes impractical in the big data era where raw data is largely un-labeled and uncategorized. Fortunately, the adversarial learning, represented by Generative Adversarial Network (GAN), enjoys a great success on the unsupervised learning. However, the distinct features of GAN, such as massive computing phases and non-traditional convolutions challenge the existing deep learning accelerator designs. In this work, we propose the first holistic solution for accelerating the unsupervised GAN-based Deep Learning. We overcome the above challenges with an algorithm and architecture co-design approach. First, we optimize the training procedure to reduce on-chip memory consumption. We then propose a novel time-multiplexed design to efficiently map the abundant computing phases to our microarchitecture. Moreover, we design high-efficiency dataflows to achieve high data reuse and skip the zero-operand multiplications in the non-traditional convolutions. Compared with traditional deep learning accelerators, our proposed design achieves the best performance (average 4.3X) with the same computing resource. Our design also has an average of 8.3X speedup over CPU and 6.2X energy-efficiency over NVIDIA GPU.
{"title":"Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning","authors":"Mingcong Song, Jiaqi Zhang, Huixiang Chen, Tao Li","doi":"10.1109/HPCA.2018.00016","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00016","url":null,"abstract":"Recently, deep learning based approaches have emerged as indispensable tools to perform big data analytics. Normally, deep learning models are first trained with a supervised method and then deployed to execute various tasks. The supervised method involves extensive human efforts to collect and label the large-scale dataset, which becomes impractical in the big data era where raw data is largely un-labeled and uncategorized. Fortunately, the adversarial learning, represented by Generative Adversarial Network (GAN), enjoys a great success on the unsupervised learning. However, the distinct features of GAN, such as massive computing phases and non-traditional convolutions challenge the existing deep learning accelerator designs. In this work, we propose the first holistic solution for accelerating the unsupervised GAN-based Deep Learning. We overcome the above challenges with an algorithm and architecture co-design approach. First, we optimize the training procedure to reduce on-chip memory consumption. We then propose a novel time-multiplexed design to efficiently map the abundant computing phases to our microarchitecture. Moreover, we design high-efficiency dataflows to achieve high data reuse and skip the zero-operand multiplications in the non-traditional convolutions. Compared with traditional deep learning accelerators, our proposed design achieves the best performance (average 4.3X) with the same computing resource. Our design also has an average of 8.3X speedup over CPU and 6.2X energy-efficiency over NVIDIA GPU.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116320354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haonan Wang, Fan Luo, M. Ibrahim, Onur Kayiran, Adwait Jog
Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known to be effective in improving the overall performance. However, we find that such prior techniques can lead to sub-optimal system throughput and fairness when two or more applications are co-scheduled on the same GPU. It is because they attempt to maximize the performance of individual applications in isolation, ultimately allowing each application to take a disproportionate amount of shared resources. This leads to high contention in shared cache and memory. To address this problem, we propose new application-aware TLP management techniques for a multi-application execution environment such that all co-scheduled applications can make good and judicious use of all the shared resources. For measuring such use, we propose an application-level utility metric, called effective bandwidth, which accounts for two runtime metrics: attained DRAM bandwidth and cache miss rates. We find that maximizing the total effective bandwidth and doing so in a balanced fashion across all co-located applications can significantly improve the system throughput and fairness. Instead of exhaustively searching across all the different combinations of TLP configurations that achieve these goals, we find that a significant amount of overhead can be reduced by taking advantage of the trends, which we call patterns, in the way application's effective bandwidth changes with different TLP combinations. Our proposed pattern-based TLP management mechanisms improve the system throughput and fairness by 20% and 2x, respectively, over a baseline where each application executes with a TLP configuration that provides the best performance when it executes alone.
{"title":"Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management","authors":"Haonan Wang, Fan Luo, M. Ibrahim, Onur Kayiran, Adwait Jog","doi":"10.1109/HPCA.2018.00030","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00030","url":null,"abstract":"Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known to be effective in improving the overall performance. However, we find that such prior techniques can lead to sub-optimal system throughput and fairness when two or more applications are co-scheduled on the same GPU. It is because they attempt to maximize the performance of individual applications in isolation, ultimately allowing each application to take a disproportionate amount of shared resources. This leads to high contention in shared cache and memory. To address this problem, we propose new application-aware TLP management techniques for a multi-application execution environment such that all co-scheduled applications can make good and judicious use of all the shared resources. For measuring such use, we propose an application-level utility metric, called effective bandwidth, which accounts for two runtime metrics: attained DRAM bandwidth and cache miss rates. We find that maximizing the total effective bandwidth and doing so in a balanced fashion across all co-located applications can significantly improve the system throughput and fairness. Instead of exhaustively searching across all the different combinations of TLP configurations that achieve these goals, we find that a significant amount of overhead can be reduced by taking advantage of the trends, which we call patterns, in the way application's effective bandwidth changes with different TLP combinations. Our proposed pattern-based TLP management mechanisms improve the system throughput and fairness by 20% and 2x, respectively, over a baseline where each application executes with a TLP configuration that provides the best performance when it executes alone.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114547752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers
Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.
{"title":"Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level","authors":"Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers","doi":"10.1109/HPCA.2018.00058","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00058","url":null,"abstract":"Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114842706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Bakhshalipour, P. Lotfi-Kamran, H. Sarbazi-Azad
Big-data server applications frequently encounter data misses, and hence, lose significant performance potential. One way to reduce the number of data misses or their effect is data prefetching. As data accesses have high temporal correlations, temporal prefetching techniques are promising for them. While state-of-the-art temporal prefetching techniques are effective at reducing the number of data misses, we observe that there is a significant gap between what they offer and the opportunity. This work aims to improve the effectiveness of temporal prefetching techniques. We identify the lookup mechanism of existing temporal prefetchers responsible for the large gap between what they offer and the opportunity. Existing lookup mechanisms either not choose the right stream in the history, or unnecessarily delay the stream selection, and hence, miss the opportunity at the beginning of every stream. In this work, we introduce Domino prefetching to address the limitations of existing temporal prefetchers. Domino prefetcher is a temporal data prefetching technique that logically looks up the history with both one and two last miss addresses to find a match for prefetching. We propose a practical design for Domino prefetcher that employs an Enhanced Index Table that is indexed by just a single miss address. We show that Domino prefetcher captures more than 90% of the temporal opportunity. Through detailed evaluation targeting a quad-core processor and a set of server workloads, we show that Domino prefetcher improves system performance by 16% over the baseline with no data prefetcher and 6% over the state-of- the-art temporal data prefetcher.
大数据服务器应用程序经常会遇到数据丢失,因此会损失巨大的性能潜力。减少数据丢失数量或其影响的一种方法是数据预取。由于数据访问具有较高的时间相关性,因此时间预取技术很有前景。虽然最先进的时间预取技术在减少数据丢失数量方面是有效的,但我们观察到,它们提供的内容与机会之间存在显着差距。本工作旨在提高时间预取技术的有效性。我们确定了现有时间预取器的查找机制,这些机制导致它们提供的内容与机会之间存在巨大差距。现有的查找机制要么不能在历史中选择正确的流,要么不必要地延迟流选择,因此,错过了每个流开始时的机会。在本文中,我们引入Domino预取来解决现有时间预取器的限制。Domino预取器是一种临时数据预取技术,它在逻辑上查找具有最后一个和两个丢失地址的历史记录,以找到用于预取的匹配项。我们为Domino预取器提出了一种实用的设计,该设计使用了一个增强型索引表(Enhanced Index Table),该索引表仅通过一个缺失地址进行索引。我们展示了Domino预取器捕获了超过90%的临时机会。通过针对四核处理器和一组服务器工作负载的详细评估,我们发现Domino预取器比没有数据预取器的基准性能提高了16%,比最先进的临时数据预取器提高了6%。
{"title":"Domino Temporal Data Prefetcher","authors":"Mohammad Bakhshalipour, P. Lotfi-Kamran, H. Sarbazi-Azad","doi":"10.1109/HPCA.2018.00021","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00021","url":null,"abstract":"Big-data server applications frequently encounter data misses, and hence, lose significant performance potential. One way to reduce the number of data misses or their effect is data prefetching. As data accesses have high temporal correlations, temporal prefetching techniques are promising for them. While state-of-the-art temporal prefetching techniques are effective at reducing the number of data misses, we observe that there is a significant gap between what they offer and the opportunity. This work aims to improve the effectiveness of temporal prefetching techniques. We identify the lookup mechanism of existing temporal prefetchers responsible for the large gap between what they offer and the opportunity. Existing lookup mechanisms either not choose the right stream in the history, or unnecessarily delay the stream selection, and hence, miss the opportunity at the beginning of every stream. In this work, we introduce Domino prefetching to address the limitations of existing temporal prefetchers. Domino prefetcher is a temporal data prefetching technique that logically looks up the history with both one and two last miss addresses to find a match for prefetching. We propose a practical design for Domino prefetcher that employs an Enhanced Index Table that is indexed by just a single miss address. We show that Domino prefetcher captures more than 90% of the temporal opportunity. Through detailed evaluation targeting a quad-core processor and a set of server workloads, we show that Domino prefetcher improves system performance by 16% over the baseline with no data prefetcher and 6% over the state-of- the-art temporal data prefetcher.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117237981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gururaj Saileshwar, Prashant J. Nair, Prakash Ramrakhyani, Wendy Elsasser, Moinuddin K. Qureshi
Building trusted data-centers requires resilient memories which are protected from both adversarial attacks and errors. Unfortunately, the state-of-the-art memory security solutions incur considerable performance overheads due to accesses for security metadata like Message Authentication Codes (MACs). At the same time, commercial secure memory solutions tend to be designed oblivious to the presence of memory reliability mechanisms (such as ECC-DIMMs), that provide tolerance to memory errors. Fortunately, ECC-DIMMs possess an additional chip for providing error correction codes (ECC), that is accessed in parallel with data, which can be harnessed for security optimizations. If we can re-purpose the ECC-chip to store some metadata useful for security and reliability, it can prove beneficial to both. To this end, this paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs. Synergy uses the insight that MACs being capable of detecting data tampering are also useful for detecting memory errors. Therefore, MACs are best suited for being placed inside the ECC chip, to be accessed in parallel with each data access. By co-locating MAC and Data, Synergy is able to avoid a separate memory access for MAC and thereby reduce the overall memory traffic for secure memory systems. Furthermore, Synergy is able to tolerate 1 chip failure out of 9 chips by using a parity that is constructed over 9 chips (8 Data and 1 MAC), which is used for reconstructing the data of the failed chip. For memory intensive workloads, Synergy provides a speedup of 20% and reduces system Energy Delay Product by 31% compared to a secure memory baseline with ECC-DIMMs. At the same time, Synergy increases reliability by 185x compared to ECC-DIMMs that provide Single-Error Correction, Double-Error Detection (SECDED) capability. Synergy uses commercial ECC-DIMMs and does not incur any additional hardware overheads or reduction of security.
{"title":"SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories","authors":"Gururaj Saileshwar, Prashant J. Nair, Prakash Ramrakhyani, Wendy Elsasser, Moinuddin K. Qureshi","doi":"10.1109/HPCA.2018.00046","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00046","url":null,"abstract":"Building trusted data-centers requires resilient memories which are protected from both adversarial attacks and errors. Unfortunately, the state-of-the-art memory security solutions incur considerable performance overheads due to accesses for security metadata like Message Authentication Codes (MACs). At the same time, commercial secure memory solutions tend to be designed oblivious to the presence of memory reliability mechanisms (such as ECC-DIMMs), that provide tolerance to memory errors. Fortunately, ECC-DIMMs possess an additional chip for providing error correction codes (ECC), that is accessed in parallel with data, which can be harnessed for security optimizations. If we can re-purpose the ECC-chip to store some metadata useful for security and reliability, it can prove beneficial to both. To this end, this paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs. Synergy uses the insight that MACs being capable of detecting data tampering are also useful for detecting memory errors. Therefore, MACs are best suited for being placed inside the ECC chip, to be accessed in parallel with each data access. By co-locating MAC and Data, Synergy is able to avoid a separate memory access for MAC and thereby reduce the overall memory traffic for secure memory systems. Furthermore, Synergy is able to tolerate 1 chip failure out of 9 chips by using a parity that is constructed over 9 chips (8 Data and 1 MAC), which is used for reconstructing the data of the failed chip. For memory intensive workloads, Synergy provides a speedup of 20% and reduces system Energy Delay Product by 31% compared to a secure memory baseline with ECC-DIMMs. At the same time, Synergy increases reliability by 185x compared to ECC-DIMMs that provide Single-Error Correction, Double-Error Detection (SECDED) capability. Synergy uses commercial ECC-DIMMs and does not incur any additional hardware overheads or reduction of security.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128911348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael McKeown, Alexey Lavrov, Mohammad Shahrad, Paul J. Jackson, Yaosheng Fu, Jonathan Balkind, Tri M. Nguyen, Katie Lim, Yanqi Zhou, D. Wentzlaff
The end of Dennard’s scaling and the looming power wall have made power and energy primary design goals for modern processors. Further, new applications such as cloud computing and Internet of Things (IoT) continue to necessitate increased performance and energy efficiency. Manycore processors show potential in addressing some of these issues. However, there is little detailed power and energy data on manycore processors. In this work, we carefully study detailed power and energy characteristics of Piton, a 25-core modern open source academic processor, including voltage versus frequency scaling, energy per instruction (EPI), memory system energy, network-on-chip (NoC) energy, thermal characteristics, and application performance and power consumption. This is the first detailed power and energy characterization of an open source manycore design implemented in silicon. The open source nature of the processor provides increased value, enabling detailed characterization verified against simulation and the ability to correlate results with the design and register transfer level (RTL) model. Additionally, this enables other researchers to utilize this work to build new power models, devise new research directions, and perform accurate power and energy research using the open source processor. The characterization data reveals a number of interesting insights, including that operand values have a large impact on EPI, recomputing data can be more energy efficient than loading it from memory, on-chip data transmission (NoC) energy is low, and insights on energy efficient multithreaded core design. All data collected and the hardware infrastructure used is open source and available for download at http://www.openpiton.org.
{"title":"Power and Energy Characterization of an Open Source 25-Core Manycore Processor","authors":"Michael McKeown, Alexey Lavrov, Mohammad Shahrad, Paul J. Jackson, Yaosheng Fu, Jonathan Balkind, Tri M. Nguyen, Katie Lim, Yanqi Zhou, D. Wentzlaff","doi":"10.1109/HPCA.2018.00070","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00070","url":null,"abstract":"The end of Dennard’s scaling and the looming power wall have made power and energy primary design goals for modern processors. Further, new applications such as cloud computing and Internet of Things (IoT) continue to necessitate increased performance and energy efficiency. Manycore processors show potential in addressing some of these issues. However, there is little detailed power and energy data on manycore processors. In this work, we carefully study detailed power and energy characteristics of Piton, a 25-core modern open source academic processor, including voltage versus frequency scaling, energy per instruction (EPI), memory system energy, network-on-chip (NoC) energy, thermal characteristics, and application performance and power consumption. This is the first detailed power and energy characterization of an open source manycore design implemented in silicon. The open source nature of the processor provides increased value, enabling detailed characterization verified against simulation and the ability to correlate results with the design and register transfer level (RTL) model. Additionally, this enables other researchers to utilize this work to build new power models, devise new research directions, and perform accurate power and energy research using the open source processor. The characterization data reveals a number of interesting insights, including that operand values have a large impact on EPI, recomputing data can be more energy efficient than loading it from memory, on-chip data transmission (NoC) energy is low, and insights on energy efficient multithreaded core design. All data collected and the hardware infrastructure used is open source and available for download at http://www.openpiton.org.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131761996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}