首页 > 最新文献

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

英文 中文
Elastic Instruction Fetching 弹性指令获取
Arthur Perais, Rami Sheikh, Luke Yen, Michael McIlvaine, R. Clancy
Branch prediction (i.e., the generation of fetch addresses) and instruction cache accesses need not be tightly coupled. As the instruction fetch stage stalls because of an ICache miss or back-pressure, the branch predictor may run ahead and generate future fetch addresses that can be used for different optimizations, such as instruction prefetching but more importantly hiding taken branch fetch bubbles. This approach is used in many commercially available highperformance design. However, decoupling branch prediction from instruction retrieval has several drawbacks. First, it can increase the pipeline depth, leading to more expensive pipeline flushes. Second, it requires a large Branch Target Buffer (BTB) to store branch targets, allowing the branch predictor to follow taken branches without decoding instruction bytes. Missing the BTB will also cause additional bubbles. In some classes of workloads, those drawbacks may significantly offset the benefits of decoupling. In this paper, we present ELastic Fetching (ELF), a hybrid mechanism that decouples branch prediction from instruction retrieval while minimizing additional bubbles on pipeline flushes and BTB misses. We present two different implementations that trade off complexity for additional performance. Both variants outperform a baseline decoupled fetcher design by up to 3.7% and 5.2%, respectively.
分支预测(即获取地址的生成)和指令缓存访问不需要紧密耦合。当指令获取阶段由于ICache缺失或回压而停止时,分支预测器可能会提前运行并生成未来的获取地址,这些地址可用于不同的优化,例如指令预取,但更重要的是隐藏已获取的分支获取气泡。这种方法被用于许多市售的高性能设计。然而,将分支预测与指令检索解耦存在一些缺点。首先,它可以增加管道深度,导致更昂贵的管道冲洗。其次,它需要一个大的分支目标缓冲区(BTB)来存储分支目标,允许分支预测器在不解码指令字节的情况下跟踪已取的分支。缺少BTB也会导致额外的气泡。在某些类型的工作负载中,这些缺点可能会显著抵消解耦的好处。在本文中,我们提出了弹性提取(ELF),这是一种混合机制,它将分支预测与指令检索解耦,同时最大限度地减少管道刷新和BTB丢失时的额外气泡。我们提供了两种不同的实现,它们在复杂性和额外性能之间进行了权衡。这两种变体的性能分别比基线解耦提取器设计高出3.7%和5.2%。
{"title":"Elastic Instruction Fetching","authors":"Arthur Perais, Rami Sheikh, Luke Yen, Michael McIlvaine, R. Clancy","doi":"10.1109/HPCA.2019.00059","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00059","url":null,"abstract":"Branch prediction (i.e., the generation of fetch addresses) and instruction cache accesses need not be tightly coupled. As the instruction fetch stage stalls because of an ICache miss or back-pressure, the branch predictor may run ahead and generate future fetch addresses that can be used for different optimizations, such as instruction prefetching but more importantly hiding taken branch fetch bubbles. This approach is used in many commercially available highperformance design. However, decoupling branch prediction from instruction retrieval has several drawbacks. First, it can increase the pipeline depth, leading to more expensive pipeline flushes. Second, it requires a large Branch Target Buffer (BTB) to store branch targets, allowing the branch predictor to follow taken branches without decoding instruction bytes. Missing the BTB will also cause additional bubbles. In some classes of workloads, those drawbacks may significantly offset the benefits of decoupling. In this paper, we present ELastic Fetching (ELF), a hybrid mechanism that decouples branch prediction from instruction retrieval while minimizing additional bubbles on pipeline flushes and BTB misses. We present two different implementations that trade off complexity for additional performance. Both variants outperform a baseline decoupled fetcher design by up to 3.7% and 5.2%, respectively.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133876687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Keynote Abstracts 主题摘要
{"title":"Keynote Abstracts","authors":"","doi":"10.1109/hpca.2019.00-44","DOIUrl":"https://doi.org/10.1109/hpca.2019.00-44","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115576431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CIDR: A Cost-Effective In-Line Data Reduction System for Terabit-Per-Second Scale SSD Arrays 一种高性价比的用于每秒太比特级SSD阵列的在线数据缩减系统
M. Ajdari, Pyeongsu Park, Joonsung Kim, Dongup Kwon, Jang-Hyun Kim
An SSD array, a storage system consisting of multiple SSDs per node, has become a design choice to implement a fast primary storage system, and modern storage architects now aim to achieve terabit-per-second scale performance with the next-generation SSD array. To reduce the storage cost and improve the device endurability, such SSD array must employ data reduction schemes (i.e., deduplication, compression), which provide high data reduction capability at minimum costs. However, existing data reduction schemes do not scale with the fast increasing performance of an SSD array, due to inhibitive amount of CPU resources (e.g., in software-based schemes) or low data reduction ratio (e.g., in SSD device wide deduplication) or being cost ineffective to address workload changes in datacenters (e.g., in ASIC-based acceleration). In this paper, we propose CIDR, a novel FPGA-based, cost-effective data reduction system for an SSD array to achieve the terabit-per-second scale storage performance. Our key ideas are as follows. First, we decouple data reductionrelated computing tasks from the unscalable host CPUs by offloading them to a scalable array of FPGA boards. Second, we employ a centralized, node-wide metadata management scheme to achieve an SSD array-wide, high data reduction. Third, our FPGA-based reconfiguration adapts to different workload patterns by dynamically balancing the amount of software and hardware tasks running on CPUs and FPGAs, respectively. For evaluation, we built our example CIDR prototype achieving up to 12.8 GB/s (0.1 Tbps) on one FPGA. CIDR outperforms the baseline for a write-only workload by up to 2.47x and a mixed read-write workload by an expected 3.2x, respectively. We showed CIDR’s scalability to achieve Tbps-scale performance by measuring a two-FPGA CIDR and projecting the performance impacts for more FPGAs. Keywords-deduplication; compression; FPGA; SSD array;
SSD阵列(每个节点由多个SSD组成的存储系统)已经成为实现快速主存储系统的设计选择,现代存储架构师现在的目标是通过下一代SSD阵列实现每秒太比特级的性能。为了降低存储成本和提高设备的耐用性,该类SSD阵列必须采用数据缩减方案(即重复数据删除、压缩),以最小的成本提供较高的数据缩减能力。然而,现有的数据缩减方案不能随着SSD阵列性能的快速增长而扩展,原因是CPU资源的抑制量(例如,基于软件的方案)或低数据缩减率(例如,在SSD设备范围内的重复数据删除),或者在处理数据中心的工作负载变化(例如,在基于asic的加速中)时成本无效。在本文中,我们提出CIDR,一种新颖的基于fpga的,具有成本效益的SSD阵列数据缩减系统,以实现每秒太比特级的存储性能。我们的主要想法如下。首先,我们将数据约简相关的计算任务从不可扩展的主机cpu上解耦,将它们卸载到可扩展的FPGA板阵列上。其次,我们采用集中的节点范围的元数据管理方案来实现SSD阵列范围的高数据缩减。第三,我们基于fpga的重新配置通过动态平衡分别在cpu和fpga上运行的软件和硬件任务的数量来适应不同的工作负载模式。为了进行评估,我们在一个FPGA上构建了我们的示例CIDR原型,达到了12.8 GB/s (0.1 Tbps)。在只写工作负载和混合读写工作负载下,CIDR分别比基准性能高出预期的2.47倍和3.2倍。我们通过测量双fpga CIDR并预测更多fpga对性能的影响,展示了CIDR实现tps级性能的可扩展性。Keywords-deduplication;压缩;FPGA;SSD数组;
{"title":"CIDR: A Cost-Effective In-Line Data Reduction System for Terabit-Per-Second Scale SSD Arrays","authors":"M. Ajdari, Pyeongsu Park, Joonsung Kim, Dongup Kwon, Jang-Hyun Kim","doi":"10.1109/HPCA.2019.00025","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00025","url":null,"abstract":"An SSD array, a storage system consisting of multiple SSDs per node, has become a design choice to implement a fast primary storage system, and modern storage architects now aim to achieve terabit-per-second scale performance with the next-generation SSD array. To reduce the storage cost and improve the device endurability, such SSD array must employ data reduction schemes (i.e., deduplication, compression), which provide high data reduction capability at minimum costs. However, existing data reduction schemes do not scale with the fast increasing performance of an SSD array, due to inhibitive amount of CPU resources (e.g., in software-based schemes) or low data reduction ratio (e.g., in SSD device wide deduplication) or being cost ineffective to address workload changes in datacenters (e.g., in ASIC-based acceleration). In this paper, we propose CIDR, a novel FPGA-based, cost-effective data reduction system for an SSD array to achieve the terabit-per-second scale storage performance. Our key ideas are as follows. First, we decouple data reductionrelated computing tasks from the unscalable host CPUs by offloading them to a scalable array of FPGA boards. Second, we employ a centralized, node-wide metadata management scheme to achieve an SSD array-wide, high data reduction. Third, our FPGA-based reconfiguration adapts to different workload patterns by dynamically balancing the amount of software and hardware tasks running on CPUs and FPGAs, respectively. For evaluation, we built our example CIDR prototype achieving up to 12.8 GB/s (0.1 Tbps) on one FPGA. CIDR outperforms the baseline for a write-only workload by up to 2.47x and a mixed read-write workload by an expected 3.2x, respectively. We showed CIDR’s scalability to achieve Tbps-scale performance by measuring a two-FPGA CIDR and projecting the performance impacts for more FPGAs. Keywords-deduplication; compression; FPGA; SSD array;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116180130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
NAND-Net: Minimizing Computational Complexity of In-Memory Processing for Binary Neural Networks NAND-Net:最小化二进制神经网络内存处理的计算复杂度
Hyeonuk Kim, Jaehyeong Sim, Yeongjae Choi, L. Kim
{"title":"NAND-Net: Minimizing Computational Complexity of In-Memory Processing for Binary Neural Networks","authors":"Hyeonuk Kim, Jaehyeong Sim, Yeongjae Choi, L. Kim","doi":"10.1109/HPCA.2019.00017","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00017","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127354861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Adaptive Voltage/Frequency Scaling and Core Allocation for Balanced Energy and Performance on Multicore CPUs 多核cpu上能量与性能平衡的自适应电压/频率缩放和核心分配
G. Papadimitriou, Athanasios Chatzidimitriou, D. Gizopoulos
Energy efficiency is a known major concern for computing system designers. Significant effort is devoted to power optimization of modern systems, especially in largescale installations such as data centers, in which both high performance and energy efficiency are important. Power optimization can be achieved through different approaches, several of which focus on adaptive voltage regulation. In this paper, we present a comprehensive exploration of how two server-grade systems behave in different frequency and core allocation configurations beyond nominal voltage operation. Our analysis, which is built on top of two state-of-the-art ARMv8 microprocessor chips (Applied Micro’s X-Gene 2 and X-Gene 3) aims (1) to identify the best performance per watt operation points when the servers are operating in various voltage/frequency combinations, (2) to reveal how and why the different core allocation options on the available cores of the microprocessor affect the energy consumption, and (3) to enhance the default Linux scheduler to take task allocation decisions for balanced performance and energy efficiency. Our findings, on actual servers’ hardware, have been integrated into a lightweight online monitoring daemon which decides the optimal combination of voltage, core allocation, and clock frequency to achieve higher energy efficiency. Our approach reduces on average the energy by 25.2% on X-Gene 2, and 22.3% on X-Gene 3, with a minimal performance penalty of 3.2% on X-Gene 2 and 2.5% on X-Gene 3, compared to the default system configuration. Keywords-Energy efficiency; voltage and frequency scaling; power consumption; multicore characterization; micro-servers;
众所周知,能源效率是计算系统设计者最关心的问题。在现代系统的电源优化方面投入了大量的努力,特别是在数据中心等大型安装中,高性能和能源效率都很重要。功率优化可以通过不同的方法来实现,其中一些方法侧重于自适应电压调节。在本文中,我们全面探讨了两个服务器级系统在标称电压操作之外的不同频率和核心分配配置下的行为。我们的分析建立在两个最先进的ARMv8微处理器芯片(应用微的X-Gene 2和X-Gene 3)的基础上,目的是(1)确定服务器在各种电压/频率组合下运行时的最佳每瓦性能工作点,(2)揭示微处理器可用内核上的不同内核分配选项如何以及为什么会影响能耗。(3)增强默认的Linux调度器,以便为平衡性能和能源效率做出任务分配决策。在实际的服务器硬件上,我们的发现已经集成到一个轻量级的在线监控守护进程中,该守护进程决定电压、核心分配和时钟频率的最佳组合,以实现更高的能源效率。与默认系统配置相比,我们的方法在X-Gene 2上平均减少了25.2%的能量,在X-Gene 3上平均减少了22.3%的能量,在X-Gene 2和X-Gene 3上的性能损失最小,分别为3.2%和2.5%。Keywords-Energy效率;电压和频率缩放;功耗;多核特性;micro-servers;
{"title":"Adaptive Voltage/Frequency Scaling and Core Allocation for Balanced Energy and Performance on Multicore CPUs","authors":"G. Papadimitriou, Athanasios Chatzidimitriou, D. Gizopoulos","doi":"10.1109/HPCA.2019.00033","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00033","url":null,"abstract":"Energy efficiency is a known major concern for computing system designers. Significant effort is devoted to power optimization of modern systems, especially in largescale installations such as data centers, in which both high performance and energy efficiency are important. Power optimization can be achieved through different approaches, several of which focus on adaptive voltage regulation. In this paper, we present a comprehensive exploration of how two server-grade systems behave in different frequency and core allocation configurations beyond nominal voltage operation. Our analysis, which is built on top of two state-of-the-art ARMv8 microprocessor chips (Applied Micro’s X-Gene 2 and X-Gene 3) aims (1) to identify the best performance per watt operation points when the servers are operating in various voltage/frequency combinations, (2) to reveal how and why the different core allocation options on the available cores of the microprocessor affect the energy consumption, and (3) to enhance the default Linux scheduler to take task allocation decisions for balanced performance and energy efficiency. Our findings, on actual servers’ hardware, have been integrated into a lightweight online monitoring daemon which decides the optimal combination of voltage, core allocation, and clock frequency to achieve higher energy efficiency. Our approach reduces on average the energy by 25.2% on X-Gene 2, and 22.3% on X-Gene 3, with a minimal performance penalty of 3.2% on X-Gene 2 and 2.5% on X-Gene 3, compared to the default system configuration. Keywords-Energy efficiency; voltage and frequency scaling; power consumption; multicore characterization; micro-servers;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133089738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
eQASM: An Executable Quantum Instruction Set Architecture 一个可执行的量子指令集体系结构
X. Fu, L. Riesebos, M. A. Rol, J. V. Straten, J. V. Someren, N. Khammassi, Imran Ashraf, R. Vermeulen, V. Newsum, K. Loh, J. C. D. Sterke, W. Vlothuizen, R. N. Schouten, C. G. Almudever, Leonardo DiCarlo, K. Bertels
A widely-used quantum programming paradigm comprises of both the data flow and control flow. Existing quantum hardware cannot well support the control flow, significantly limiting the range of quantum software executable on the hardware. By analyzing the constraints in the control microarchitecture, we found that existing quantum assembly languages are either too high-level or too restricted to support comprehensive flow control on the hardware. Also, as observed with the quantum microinstruction set QuMIS [1], the quantum instruction set architecture (QISA) design may suffer from limited scalability and flexibility because of microarchitectural constraints. It is an open challenge to design a scalable and flexible QISA which provides a comprehensive abstraction of the quantum hardware. In this paper, we propose an executable QISA, called eQASM, that can be translated from quantum assembly language (QASM), supports comprehensive quantum program flow control, and is executed on a quantum control microarchitecture. With efficient timing specification, single-operationmultiple-qubit execution, and a very-long-instruction-word architecture, eQASM presents better scalability than QuMIS. The definition of eQASM focuses on the assembly level to be expressive. Quantum operations are configured at compile time instead of being defined at QISA design time. We instantiate eQASM into a 32-bit instruction set targeting a seven-qubit superconducting quantum processor. We validate our design by performing several experiments on a two-qubit quantum processor. © 2019 IEEE.
一种广泛使用的量子编程范式由数据流和控制流两部分组成。现有的量子硬件不能很好地支持控制流,严重限制了量子软件在硬件上的可执行范围。通过分析控制微体系结构中的约束,我们发现现有的量子汇编语言要么太高级,要么太受限制,无法在硬件上支持全面的流控制。此外,正如量子微指令集QuMIS[1]所观察到的那样,由于微架构的限制,量子指令集架构(QISA)设计可能会受到有限的可扩展性和灵活性的影响。设计一个可扩展的、灵活的、能提供量子硬件的全面抽象的QISA是一个公开的挑战。在本文中,我们提出了一个可执行的QISA,称为eQASM,它可以从量子汇编语言(QASM)翻译而来,支持全面的量子程序流控制,并在量子控制微架构上执行。eQASM具有高效的时序规范、单操作多量子位执行和非常长的指令字架构,具有比QuMIS更好的可扩展性。eQASM的定义侧重于具有表达性的汇编级别。量子操作在编译时配置,而不是在QISA设计时定义。我们将eQASM实例化为针对7量子位超导量子处理器的32位指令集。我们通过在双量子位量子处理器上进行几个实验来验证我们的设计。©2019 ieee。
{"title":"eQASM: An Executable Quantum Instruction Set Architecture","authors":"X. Fu, L. Riesebos, M. A. Rol, J. V. Straten, J. V. Someren, N. Khammassi, Imran Ashraf, R. Vermeulen, V. Newsum, K. Loh, J. C. D. Sterke, W. Vlothuizen, R. N. Schouten, C. G. Almudever, Leonardo DiCarlo, K. Bertels","doi":"10.1109/HPCA.2019.00040","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00040","url":null,"abstract":"A widely-used quantum programming paradigm comprises of both the data flow and control flow. Existing quantum hardware cannot well support the control flow, significantly limiting the range of quantum software executable on the hardware. By analyzing the constraints in the control microarchitecture, we found that existing quantum assembly languages are either too high-level or too restricted to support comprehensive flow control on the hardware. Also, as observed with the quantum microinstruction set QuMIS [1], the quantum instruction set architecture (QISA) design may suffer from limited scalability and flexibility because of microarchitectural constraints. It is an open challenge to design a scalable and flexible QISA which provides a comprehensive abstraction of the quantum hardware. In this paper, we propose an executable QISA, called eQASM, that can be translated from quantum assembly language (QASM), supports comprehensive quantum program flow control, and is executed on a quantum control microarchitecture. With efficient timing specification, single-operationmultiple-qubit execution, and a very-long-instruction-word architecture, eQASM presents better scalability than QuMIS. The definition of eQASM focuses on the assembly level to be expressive. Quantum operations are configured at compile time instead of being defined at QISA design time. We instantiate eQASM into a 32-bit instruction set targeting a seven-qubit superconducting quantum processor. We validate our design by performing several experiments on a two-qubit quantum processor. © 2019 IEEE.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"400 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124591919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Bingo Spatial Data Prefetcher 空间数据预取器
Mohammad Bakhshalipour, Mehran Shakerinava, P. Lotfi-Kamran, H. Sarbazi-Azad
—Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the prefetchers either offer low accuracy or lose significant prediction opportunities. We identify that associating the observed spatial patterns to just a single event significantly limits the effectiveness of spatial data prefetchers. In this paper, we make a case for associating the observed spatial patterns to both short and long events to achieve high accuracy while not losing prediction opportunities. We propose Bingo spatial data prefetcher in which short and long events are used to select the best access pattern for prefetching. We propose a storage-efficient design for Bingo in such a way that just one history table is needed to maintain the association between the access patterns and the long and short events. Through a detailed evaluation of a set of big-data applications, we show that Bingo improves system performance by 60% over a baseline with no data prefetcher and 11% over the best-performing prior spatial data prefetcher.
-应用程序广泛使用具有规则和固定布局的数据对象,这导致在内存区域上重复访问模式。空间数据预取技术利用这种现象来预取未来的内存引用,并隐藏DRAM访问的长延迟。虽然最先进的空间数据预取器在减少数据丢失数量方面是有效的,但我们观察到仍有很大的改进空间。为了选择用于预取的访问模式,现有的空间预取器将观察到的访问模式与高复发概率的短事件或低复发概率的长事件关联起来。因此,预取器要么提供低精度,要么失去重要的预测机会。我们发现,将观察到的空间模式与单个事件相关联,会极大地限制空间数据预取器的有效性。在本文中,我们提出了将观测到的空间模式与短事件和长事件相关联的案例,以达到高精度,同时不会失去预测机会。我们提出了Bingo空间数据预取器,其中使用短事件和长事件来选择最佳的访问模式进行预取。我们为Bingo提出了一种存储效率高的设计,这样只需要一个历史表来维护访问模式与长事件和短事件之间的关联。通过对一组大数据应用程序的详细评估,我们表明Bingo比没有数据预取器的基线提高了60%的系统性能,比先前性能最好的空间数据预取器提高了11%。
{"title":"Bingo Spatial Data Prefetcher","authors":"Mohammad Bakhshalipour, Mehran Shakerinava, P. Lotfi-Kamran, H. Sarbazi-Azad","doi":"10.1109/HPCA.2019.00053","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00053","url":null,"abstract":"—Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the prefetchers either offer low accuracy or lose significant prediction opportunities. We identify that associating the observed spatial patterns to just a single event significantly limits the effectiveness of spatial data prefetchers. In this paper, we make a case for associating the observed spatial patterns to both short and long events to achieve high accuracy while not losing prediction opportunities. We propose Bingo spatial data prefetcher in which short and long events are used to select the best access pattern for prefetching. We propose a storage-efficient design for Bingo in such a way that just one history table is needed to maintain the association between the access patterns and the long and short events. Through a detailed evaluation of a set of big-data applications, we show that Bingo improves system performance by 60% over a baseline with no data prefetcher and 11% over the best-performing prior spatial data prefetcher.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125932587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
μDPM: Dynamic Power Management for the Microsecond Era μDPM:微秒时代的动态电源管理
C. Chou, L. Bhuyan, Daniel Wong
—The complex, distributed nature of data centers have spawned the adoption of distributed, multi-tiered software architectures, consisting of many inter-connected microservices. These microservices exhibit extremely short request service times, often less than 250 µ s. We show that these “killer microsecond” service times can cause state-of-the-art dynamic power management techniques to break down, due to short idle period length and low power state transition overheads. In this paper, we propose µ DPM , a dynamic power management scheme for the microsecond era that coordinates request delaying, per-core sleep states, and voltage frequency scaling. The idea is to postpone the wake up of a CPU as long as possible and then adjust the frequency so that the tail latency constraint of requests are satisfied just-in-time. µ DPM reduces processor energy consumption by up to 32% and consistently outperforms state-of-the-art techniques by 2x.
数据中心的复杂、分布式特性催生了分布式、多层软件架构的采用,这些架构由许多相互连接的微服务组成。这些微服务表现出极短的请求服务时间,通常小于250µs。我们表明,由于空闲时间短和功耗状态转换开销低,这些“杀手级微秒”服务时间可能导致最先进的动态电源管理技术崩溃。在本文中,我们提出了µDPM,这是一种用于微秒时代的动态电源管理方案,可协调请求延迟,每核休眠状态和电压频率缩放。其思想是尽可能延长CPU的唤醒时间,然后调整频率,以便及时满足请求的尾部延迟约束。µDPM可将处理器能耗降低多达32%,并始终比最先进的技术高出2倍。
{"title":"μDPM: Dynamic Power Management for the Microsecond Era","authors":"C. Chou, L. Bhuyan, Daniel Wong","doi":"10.1109/HPCA.2019.00032","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00032","url":null,"abstract":"—The complex, distributed nature of data centers have spawned the adoption of distributed, multi-tiered software architectures, consisting of many inter-connected microservices. These microservices exhibit extremely short request service times, often less than 250 µ s. We show that these “killer microsecond” service times can cause state-of-the-art dynamic power management techniques to break down, due to short idle period length and low power state transition overheads. In this paper, we propose µ DPM , a dynamic power management scheme for the microsecond era that coordinates request delaying, per-core sleep states, and voltage frequency scaling. The idea is to postpone the wake up of a CPU as long as possible and then adjust the frequency so that the tail latency constraint of requests are satisfied just-in-time. µ DPM reduces processor energy consumption by up to 32% and consistently outperforms state-of-the-art techniques by 2x.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126544331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Industry Session Program Committee 行业会议计划委员会
{"title":"Industry Session Program Committee","authors":"","doi":"10.1109/hpca.2019.00-46","DOIUrl":"https://doi.org/10.1109/hpca.2019.00-46","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123704943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Power Aware Heterogeneous Node Assembly 电源感知异构节点组装
Bilge Acun, A. Buyuktosunoglu, Eun Kyung Lee, Yoonho Park
To meet ever increasing computational requirements, supercomputers and data centers are beginning to utilize fat compute nodes with multiple hardware components such as manycore CPUs and accelerators. These components have intrinsic power variations even among same model components from same manufacturer. In this paper, we argue that node assembly techniques that consider these intrinsic power variations can achieve better power efficiency without any performance trade off on large scale supercomputing facilities and data centers. We propose three different node assembly techniques: (1) Sorted Assembly, (2) Balanced Power Assembly, and (3) Application-Aware Assembly. In Sorted Assembly, node components are categorized (or sorted) into groups according to their power efficiency, and components from the same group are assembled into a node. In Balanced Power Assembly, components are assembled to minimize node-to-node power variations. In Application-Aware Assembly, the most heavily used components by the application are selected based on the highest power efficiency. We evaluate the effectiveness and cost savings of the three techniques compared to the standard random assembly under different node counts and variability scenarios.
为了满足不断增长的计算需求,超级计算机和数据中心开始利用具有多个硬件组件(如多核cpu和加速器)的大型计算节点。这些组件即使在同一制造商的同一型号组件之间也具有内在的功率差异。在本文中,我们认为考虑这些内在功率变化的节点组装技术可以在大型超级计算设施和数据中心上实现更好的功率效率,而无需任何性能折衷。我们提出了三种不同的节点组装技术:(1)排序组装,(2)均衡功率组装和(3)应用感知组装。排序装配是指将节点组件按照其能效进行分组,将同一组的组件组装到一个节点中。在均衡功率装配中,组件的装配是为了最小化节点间的功率变化。在应用感知组装中,应用最频繁使用的组件是基于最高的功率效率来选择的。在不同的节点数和可变性情况下,我们评估了这三种技术与标准随机装配相比的有效性和成本节约。
{"title":"Power Aware Heterogeneous Node Assembly","authors":"Bilge Acun, A. Buyuktosunoglu, Eun Kyung Lee, Yoonho Park","doi":"10.1109/HPCA.2019.00068","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00068","url":null,"abstract":"To meet ever increasing computational requirements, supercomputers and data centers are beginning to utilize fat compute nodes with multiple hardware components such as manycore CPUs and accelerators. These components have intrinsic power variations even among same model components from same manufacturer. In this paper, we argue that node assembly techniques that consider these intrinsic power variations can achieve better power efficiency without any performance trade off on large scale supercomputing facilities and data centers. We propose three different node assembly techniques: (1) Sorted Assembly, (2) Balanced Power Assembly, and (3) Application-Aware Assembly. In Sorted Assembly, node components are categorized (or sorted) into groups according to their power efficiency, and components from the same group are assembled into a node. In Balanced Power Assembly, components are assembled to minimize node-to-node power variations. In Application-Aware Assembly, the most heavily used components by the application are selected based on the highest power efficiency. We evaluate the effectiveness and cost savings of the three techniques compared to the standard random assembly under different node counts and variability scenarios.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133571048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1