2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第5页

iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores iNPG:基于多核NoC的网内包生成加速临界区访问

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00012

Y. Yao, Zhonghai Lu

As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive "normal" NoC routers which only transmit packets into active "big" ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.

正如最近研究的那样，在限制基于cpu的多核多线程共享变量应用程序的性能方面，进入临界区的序列化竞争开销比临界区执行本身更占主导地位。在高级操作系统的各种自旋锁原语(如票证锁、ABQL、MCS锁等)和队列自旋锁(QSL)的自旋阶段中，我们发现存储临界区锁的主节点和运行竞争线程的核心之间的缓存一致性的失效确认延迟是导致锁自旋高竞争开销的主要原因。为了减少如此高的锁相干开销，我们提出了网络内数据包生成(iNPG)，将只能传输数据包的被动“普通”NoC路由器转变为可以生成数据包的主动“大”NoC路由器。部署在竞争线程附近的大型路由器可以生成数据包，在失败线程的请求到达主节点之前对其执行早期的无效确认，而不是在主节点上执行所有的一致性维护，从而缩短协议往返延迟，从而显着减少各种锁定原语的竞争开销。我们使用具有五种不同锁原语的PARSEC和SPEC OMP2012程序在Gem5中评估iNPG。实验结果表明，与目前最先进的加速临界区访问技术相比，iNPG可以有效地降低锁相干开销，将临界区访问速度平均提高1.35倍，最大提高2.03倍，从而将程序感兴趣区域(ROI)运行时间平均提高7.8%，最大提高14.7%。

{"title":"iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores","authors":"Y. Yao, Zhonghai Lu","doi":"10.1109/HPCA.2018.00012","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00012","url":null,"abstract":"As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive \"normal\" NoC routers which only transmit packets into active \"big\" ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133143044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective Facebook的应用机器学习:数据中心基础设施的视角

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00059

K. Hazelwood, Sarah Bird, D. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, P. Noordhuis, M. Smelyanskiy, Liang Xiong, Xiaodong Wang

Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at global scale. Facebook's machine learning workloads are extremely diverse: services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.

机器学习是Facebook许多重要产品和服务的核心。本文描述了支持全球范围内机器学习的硬件和软件基础设施。Facebook的机器学习工作负载非常多样化:服务在实践中需要许多不同类型的模型。这种多样性对系统堆栈中的所有层都有影响。此外，Facebook存储的所有数据中有相当大一部分是通过机器学习管道流动的，这对将数据交付给高性能分布式训练流提出了重大挑战。计算要求也很高，需要同时利用GPU和CPU平台进行训练，并利用丰富的CPU容量进行实时推理。解决这些和其他新出现的挑战仍然需要不同的努力，包括机器学习算法、软件和硬件设计。

引用次数: 493

In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems 原位人工智能:面向物联网系统的自主和增量深度学习

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00018

Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Wei-gong Zhang, Jing Wang, Tao Li

Recent years have seen an exploration of data volumes from a myriad of IoT devices, such as various sensors and ubiquitous cameras. The deluge of IoT data creates enormous opportunities for us to explore the physical world, especially with the help of deep learning techniques. Traditionally, the Cloud is the option for deploying deep learning based applications. However, the challenges of Cloud-centric IoT systems are increasing due to significant data movement overhead, escalating energy needs, and privacy issues. Rather than constantly moving a tremendous amount of raw data to the Cloud, it would be beneficial to leverage the emerging powerful IoT devices to perform the inference task. Nevertheless, the statically trained model could not efficiently handle the dynamic data in the real in-situ environments, which leads to low accuracy. Moreover, the big raw IoT data challenges the traditional supervised training method in the Cloud. To tackle the above challenges, we propose In-situ AI, the first Autonomous and Incremental computing framework and architecture for deep learning based IoT applications. We equip deep learning based IoT system with autonomous IoT data diagnosis (minimize data movement), and incremental and unsupervised training method (tackle the big raw IoT data generated in ever-changing in-situ environments). To provide efficient architectural support for this new computing paradigm, we first characterize the two In-situ AI tasks (i.e. inference and diagnosis tasks) on two popular IoT devices (i.e. mobile GPU and FPGA) and explore the design space and tradeoffs. Based on the characterization results, we propose two working modes for the In-situ AI tasks, including Single-running and Co-running modes. Moreover, we craft analytical models for these two modes to guide the best configuration selection. We also develop a novel two-level weight shared In-situ AI architecture to efficiently deploy In-situ tasks to IoT node. Compared with traditional IoT systems, our In-situ AI can reduce data movement by 28-71%, which further yields 1.4X-3.3X speedup on model update and contributes to 30-70% energy saving.

近年来，人们对各种物联网设备(如各种传感器和无处不在的摄像头)的数据量进行了探索。海量的物联网数据为我们探索物理世界创造了巨大的机会，尤其是在深度学习技术的帮助下。传统上，云是部署基于深度学习的应用程序的选择。然而，由于大量的数据移动开销、不断升级的能源需求和隐私问题，以云为中心的物联网系统面临的挑战正在增加。与其不断地将大量原始数据移动到云端，不如利用新兴的强大物联网设备来执行推理任务。然而，静态训练的模型不能有效地处理真实现场环境中的动态数据，导致模型精度较低。此外，大的原始物联网数据对传统的云监督训练方法提出了挑战。为了应对上述挑战，我们提出了原位人工智能，这是基于深度学习的物联网应用的第一个自治和增量计算框架和架构。我们为基于深度学习的物联网系统配备自主物联网数据诊断(最大限度地减少数据移动)和增量无监督训练方法(解决在不断变化的原位环境中产生的大量原始物联网数据)。为了为这种新的计算范式提供有效的架构支持，我们首先在两种流行的物联网设备(即移动GPU和FPGA)上描述了两个原位人工智能任务(即推理和诊断任务)，并探索了设计空间和权衡。基于表征结果，我们提出了原位人工智能任务的两种工作模式:单运行模式和协同运行模式。此外，我们为这两种模式制作了分析模型，以指导最佳配置选择。我们还开发了一种新的两级权重共享的原位AI架构，以有效地将原位任务部署到物联网节点。与传统的物联网系统相比，我们的原位人工智能可以减少28-71%的数据移动，进一步提高1.4 -3.3倍的模型更新速度，并节省30-70%的能源。

{"title":"In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems","authors":"Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Wei-gong Zhang, Jing Wang, Tao Li","doi":"10.1109/HPCA.2018.00018","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00018","url":null,"abstract":"Recent years have seen an exploration of data volumes from a myriad of IoT devices, such as various sensors and ubiquitous cameras. The deluge of IoT data creates enormous opportunities for us to explore the physical world, especially with the help of deep learning techniques. Traditionally, the Cloud is the option for deploying deep learning based applications. However, the challenges of Cloud-centric IoT systems are increasing due to significant data movement overhead, escalating energy needs, and privacy issues. Rather than constantly moving a tremendous amount of raw data to the Cloud, it would be beneficial to leverage the emerging powerful IoT devices to perform the inference task. Nevertheless, the statically trained model could not efficiently handle the dynamic data in the real in-situ environments, which leads to low accuracy. Moreover, the big raw IoT data challenges the traditional supervised training method in the Cloud. To tackle the above challenges, we propose In-situ AI, the first Autonomous and Incremental computing framework and architecture for deep learning based IoT applications. We equip deep learning based IoT system with autonomous IoT data diagnosis (minimize data movement), and incremental and unsupervised training method (tackle the big raw IoT data generated in ever-changing in-situ environments). To provide efficient architectural support for this new computing paradigm, we first characterize the two In-situ AI tasks (i.e. inference and diagnosis tasks) on two popular IoT devices (i.e. mobile GPU and FPGA) and explore the design space and tradeoffs. Based on the characterization results, we propose two working modes for the In-situ AI tasks, including Single-running and Co-running modes. Moreover, we craft analytical models for these two modes to guide the best configuration selection. We also develop a novel two-level weight shared In-situ AI architecture to efficiently deploy In-situ tasks to IoT node. Compared with traditional IoT systems, our In-situ AI can reduce data movement by 28-71%, which further yields 1.4X-3.3X speedup on model update and contributes to 30-70% energy saving.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122073943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 83

D-ORAM: Path-ORAM Delegation for Low Execution Interference on Cloud Servers with Untrusted Memory D-ORAM:具有不可信内存的云服务器上低执行干扰的路径- oram委托

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00043

Rujia Wang, Youtao Zhang, Jun Yang

Cloud computing has evolved into a promising computing paradigm. However, it remains a challenging task to protect application privacy and, in particular, the memory access patterns, on cloud servers. The Path ORAM protocol achieves high-level privacy protection but requires large memory bandwidth, which introduces severe execution interference. The recently proposed secure memory model greatly reduces the security enhancement overhead but demands the secure integration of cryptographic logic and memory devices, a memory architecture that is yet to prevail in mainstream cloud servers.,,,, In this paper, we propose D-ORAM, a novel Path ORAM scheme for achieving high-level privacy protection and low execution interference on cloud servers with untrusted memory. D-ORAM leverages the buffer-on-board (BOB) memory architecture to offload the Path ORAM primitives to a secure engine in the BOB unit, which greatly alleviates the contention for the off-chip memory bus between secure and non-secure applications. D-ORAM upgrades only one secure memory channel and employs Path ORAM tree split to extend the secure application flexibly across multiple channels, in particular, the non-secure channels. D-ORAM optimizes the link utilization to further improve the system performance. Our evaluation shows that D-ORAM effectively protects application privacy on mainstream computing servers with untrusted memory, with an improvement of NS-App performance by 22.5% on average over the Path ORAM baseline.

云计算已经发展成为一种很有前途的计算范式。然而，保护云服务器上的应用程序隐私，特别是内存访问模式，仍然是一项具有挑战性的任务。Path ORAM协议实现了高级的隐私保护，但需要较大的内存带宽，这将引入严重的执行干扰。最近提出的安全内存模型大大降低了安全增强的开销，但需要加密逻辑和存储设备的安全集成，这种存储架构在主流云服务器中尚未普及。，，，，在本文中，我们提出了一种新的路径ORAM方案D-ORAM，用于在具有不可信内存的云服务器上实现高级别隐私保护和低执行干扰。D-ORAM利用板上缓冲(BOB)内存架构将Path ORAM原语卸载到BOB单元中的安全引擎，这大大减轻了安全和非安全应用程序之间对片外内存总线的争夺。D-ORAM仅对一个安全内存通道进行升级，并采用Path ORAM树拆分的方式，将安全应用灵活地扩展到多个通道，特别是非安全通道。D-ORAM通过优化链路利用率，进一步提高系统性能。我们的评估表明，D-ORAM有效地保护了具有不可信内存的主流计算服务器上的应用程序隐私，与Path ORAM基线相比，NS-App性能平均提高了22.5%。

{"title":"D-ORAM: Path-ORAM Delegation for Low Execution Interference on Cloud Servers with Untrusted Memory","authors":"Rujia Wang, Youtao Zhang, Jun Yang","doi":"10.1109/HPCA.2018.00043","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00043","url":null,"abstract":"Cloud computing has evolved into a promising computing paradigm. However, it remains a challenging task to protect application privacy and, in particular, the memory access patterns, on cloud servers. The Path ORAM protocol achieves high-level privacy protection but requires large memory bandwidth, which introduces severe execution interference. The recently proposed secure memory model greatly reduces the security enhancement overhead but demands the secure integration of cryptographic logic and memory devices, a memory architecture that is yet to prevail in mainstream cloud servers.,,,, In this paper, we propose D-ORAM, a novel Path ORAM scheme for achieving high-level privacy protection and low execution interference on cloud servers with untrusted memory. D-ORAM leverages the buffer-on-board (BOB) memory architecture to offload the Path ORAM primitives to a secure engine in the BOB unit, which greatly alleviates the contention for the off-chip memory bus between secure and non-secure applications. D-ORAM upgrades only one secure memory channel and employs Path ORAM tree split to extend the secure application flexibly across multiple channels, in particular, the non-secure channels. D-ORAM optimizes the link utilization to further improve the system performance. Our evaluation shows that D-ORAM effectively protects application privacy on mainstream computing servers with untrusted memory, with an improvement of NS-App performance by 22.5% on average over the Path ORAM baseline.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132147842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management 基于有效带宽管理的gpu高效公平多路编程

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00030

Haonan Wang, Fan Luo, M. Ibrahim, Onur Kayiran, Adwait Jog

Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known to be effective in improving the overall performance. However, we find that such prior techniques can lead to sub-optimal system throughput and fairness when two or more applications are co-scheduled on the same GPU. It is because they attempt to maximize the performance of individual applications in isolation, ultimately allowing each application to take a disproportionate amount of shared resources. This leads to high contention in shared cache and memory. To address this problem, we propose new application-aware TLP management techniques for a multi-application execution environment such that all co-scheduled applications can make good and judicious use of all the shared resources. For measuring such use, we propose an application-level utility metric, called effective bandwidth, which accounts for two runtime metrics: attained DRAM bandwidth and cache miss rates. We find that maximizing the total effective bandwidth and doing so in a balanced fashion across all co-located applications can significantly improve the system throughput and fairness. Instead of exhaustively searching across all the different combinations of TLP configurations that achieve these goals, we find that a significant amount of overhead can be reduced by taking advantage of the trends, which we call patterns, in the way application's effective bandwidth changes with different TLP combinations. Our proposed pattern-based TLP management mechanisms improve the system throughput and fairness by 20% and 2x, respectively, over a baseline where each application executes with a TLP configuration that provides the best performance when it executes alone.

通过将GPGPU应用程序的线程级并行性(TLP)限制在一定程度上来管理它，可以有效地提高整体性能。然而，我们发现，当两个或多个应用程序在同一GPU上共同调度时，这种先前的技术可能导致次优的系统吞吐量和公平性。这是因为它们试图最大限度地提高单独应用程序的性能，最终允许每个应用程序占用不成比例的共享资源。这将导致共享缓存和内存中的高度争用。为了解决这个问题，我们为多应用程序执行环境提出了新的应用程序感知TLP管理技术，这样所有共同调度的应用程序都可以很好地、明智地使用所有共享资源。为了测量这种使用，我们提出了一个应用级效用度量，称为有效带宽，它包含两个运行时度量:达到的DRAM带宽和缓存丢失率。我们发现，最大化总有效带宽，并在所有共存的应用程序中以平衡的方式这样做，可以显著提高系统吞吐量和公平性。我们没有穷尽地搜索实现这些目标的所有不同的TLP配置组合，而是发现可以通过利用趋势(我们称之为模式)来减少大量的开销，因为应用程序的有效带宽会随着不同的TLP组合而变化。我们提出的基于模式的TLP管理机制将系统吞吐量和公平性分别提高了20%和2x，在基线中，每个应用程序都使用单独执行时提供最佳性能的TLP配置执行。

{"title":"Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management","authors":"Haonan Wang, Fan Luo, M. Ibrahim, Onur Kayiran, Adwait Jog","doi":"10.1109/HPCA.2018.00030","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00030","url":null,"abstract":"Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known to be effective in improving the overall performance. However, we find that such prior techniques can lead to sub-optimal system throughput and fairness when two or more applications are co-scheduled on the same GPU. It is because they attempt to maximize the performance of individual applications in isolation, ultimately allowing each application to take a disproportionate amount of shared resources. This leads to high contention in shared cache and memory. To address this problem, we propose new application-aware TLP management techniques for a multi-application execution environment such that all co-scheduled applications can make good and judicious use of all the shared resources. For measuring such use, we propose an application-level utility metric, called effective bandwidth, which accounts for two runtime metrics: attained DRAM bandwidth and cache miss rates. We find that maximizing the total effective bandwidth and doing so in a balanced fashion across all co-located applications can significantly improve the system throughput and fairness. Instead of exhaustively searching across all the different combinations of TLP configurations that achieve these goals, we find that a significant amount of overhead can be reduced by taking advantage of the trends, which we call patterns, in the way application's effective bandwidth changes with different TLP combinations. Our proposed pattern-based TLP management mechanisms improve the system throughput and fairness by 20% and 2x, respectively, over a baseline where each application executes with a TLP configuration that provides the best performance when it executes alone.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114547752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level 迷失在抽象中:在中级语言水平分析gpu的陷阱

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00058

Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers

Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.

现代GPU框架使用两阶段编译方法。用高级语言编写的内核最初被编译为与实现无关的中间语言(IL)，然后只有在目标GPU硬件已知时才最终被编译为机器ISA。学术界可用的大多数GPU微架构模拟器执行IL指令，因为与指令相关的功能状态实质上较少，并且在某些情况下，机器ISA的知识产权可能不会公开披露。在本文中，我们演示了使用这种高级抽象来评估gpu的陷阱，并说明了几个重要的微架构交互仅在执行低级指令时可见。我们的分析表明，给定相同的应用程序源代码和GPU微架构模型，执行行为将根据指令集抽象而显着不同。例如，我们的分析表明，机器ISA的动态指令计数平均接近IL的2倍，但由于优化的资源利用率，向量寄存器的争用减少了3倍。此外，我们的分析强调了使用IL来建模指令获取、控制发散和值相似的不足。最后，我们表明，在将绝对运行时间与真实硬件进行比较时，与机器ISA相比，模拟IL指令增加了33%的误差。

{"title":"Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level","authors":"Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers","doi":"10.1109/HPCA.2018.00058","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00058","url":null,"abstract":"Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114842706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Domino Temporal Data Prefetcher Domino临时数据预取器

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00021

Mohammad Bakhshalipour, P. Lotfi-Kamran, H. Sarbazi-Azad

Big-data server applications frequently encounter data misses, and hence, lose significant performance potential. One way to reduce the number of data misses or their effect is data prefetching. As data accesses have high temporal correlations, temporal prefetching techniques are promising for them. While state-of-the-art temporal prefetching techniques are effective at reducing the number of data misses, we observe that there is a significant gap between what they offer and the opportunity. This work aims to improve the effectiveness of temporal prefetching techniques. We identify the lookup mechanism of existing temporal prefetchers responsible for the large gap between what they offer and the opportunity. Existing lookup mechanisms either not choose the right stream in the history, or unnecessarily delay the stream selection, and hence, miss the opportunity at the beginning of every stream. In this work, we introduce Domino prefetching to address the limitations of existing temporal prefetchers. Domino prefetcher is a temporal data prefetching technique that logically looks up the history with both one and two last miss addresses to find a match for prefetching. We propose a practical design for Domino prefetcher that employs an Enhanced Index Table that is indexed by just a single miss address. We show that Domino prefetcher captures more than 90% of the temporal opportunity. Through detailed evaluation targeting a quad-core processor and a set of server workloads, we show that Domino prefetcher improves system performance by 16% over the baseline with no data prefetcher and 6% over the state-of- the-art temporal data prefetcher.

大数据服务器应用程序经常会遇到数据丢失，因此会损失巨大的性能潜力。减少数据丢失数量或其影响的一种方法是数据预取。由于数据访问具有较高的时间相关性，因此时间预取技术很有前景。虽然最先进的时间预取技术在减少数据丢失数量方面是有效的，但我们观察到，它们提供的内容与机会之间存在显着差距。本工作旨在提高时间预取技术的有效性。我们确定了现有时间预取器的查找机制，这些机制导致它们提供的内容与机会之间存在巨大差距。现有的查找机制要么不能在历史中选择正确的流，要么不必要地延迟流选择，因此，错过了每个流开始时的机会。在本文中，我们引入Domino预取来解决现有时间预取器的限制。Domino预取器是一种临时数据预取技术，它在逻辑上查找具有最后一个和两个丢失地址的历史记录，以找到用于预取的匹配项。我们为Domino预取器提出了一种实用的设计，该设计使用了一个增强型索引表(Enhanced Index Table)，该索引表仅通过一个缺失地址进行索引。我们展示了Domino预取器捕获了超过90%的临时机会。通过针对四核处理器和一组服务器工作负载的详细评估，我们发现Domino预取器比没有数据预取器的基准性能提高了16%，比最先进的临时数据预取器提高了6%。

{"title":"Domino Temporal Data Prefetcher","authors":"Mohammad Bakhshalipour, P. Lotfi-Kamran, H. Sarbazi-Azad","doi":"10.1109/HPCA.2018.00021","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00021","url":null,"abstract":"Big-data server applications frequently encounter data misses, and hence, lose significant performance potential. One way to reduce the number of data misses or their effect is data prefetching. As data accesses have high temporal correlations, temporal prefetching techniques are promising for them. While state-of-the-art temporal prefetching techniques are effective at reducing the number of data misses, we observe that there is a significant gap between what they offer and the opportunity. This work aims to improve the effectiveness of temporal prefetching techniques. We identify the lookup mechanism of existing temporal prefetchers responsible for the large gap between what they offer and the opportunity. Existing lookup mechanisms either not choose the right stream in the history, or unnecessarily delay the stream selection, and hence, miss the opportunity at the beginning of every stream. In this work, we introduce Domino prefetching to address the limitations of existing temporal prefetchers. Domino prefetcher is a temporal data prefetching technique that logically looks up the history with both one and two last miss addresses to find a match for prefetching. We propose a practical design for Domino prefetcher that employs an Enhanced Index Table that is indexed by just a single miss address. We show that Domino prefetcher captures more than 90% of the temporal opportunity. Through detailed evaluation targeting a quad-core processor and a set of server workloads, we show that Domino prefetcher improves system performance by 16% over the baseline with no data prefetcher and 6% over the state-of- the-art temporal data prefetcher.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117237981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories 协同:重新思考纠错存储器的安全存储器设计

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00046

Gururaj Saileshwar, Prashant J. Nair, Prakash Ramrakhyani, Wendy Elsasser, Moinuddin K. Qureshi

Building trusted data-centers requires resilient memories which are protected from both adversarial attacks and errors. Unfortunately, the state-of-the-art memory security solutions incur considerable performance overheads due to accesses for security metadata like Message Authentication Codes (MACs). At the same time, commercial secure memory solutions tend to be designed oblivious to the presence of memory reliability mechanisms (such as ECC-DIMMs), that provide tolerance to memory errors. Fortunately, ECC-DIMMs possess an additional chip for providing error correction codes (ECC), that is accessed in parallel with data, which can be harnessed for security optimizations. If we can re-purpose the ECC-chip to store some metadata useful for security and reliability, it can prove beneficial to both. To this end, this paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs. Synergy uses the insight that MACs being capable of detecting data tampering are also useful for detecting memory errors. Therefore, MACs are best suited for being placed inside the ECC chip, to be accessed in parallel with each data access. By co-locating MAC and Data, Synergy is able to avoid a separate memory access for MAC and thereby reduce the overall memory traffic for secure memory systems. Furthermore, Synergy is able to tolerate 1 chip failure out of 9 chips by using a parity that is constructed over 9 chips (8 Data and 1 MAC), which is used for reconstructing the data of the failed chip. For memory intensive workloads, Synergy provides a speedup of 20% and reduces system Energy Delay Product by 31% compared to a secure memory baseline with ECC-DIMMs. At the same time, Synergy increases reliability by 185x compared to ECC-DIMMs that provide Single-Error Correction, Double-Error Detection (SECDED) capability. Synergy uses commercial ECC-DIMMs and does not incur any additional hardware overheads or reduction of security.

建立可信的数据中心需要有弹性的内存，可以防止对抗性攻击和错误。不幸的是，最先进的内存安全解决方案由于访问消息身份验证码(mac)等安全元数据而导致相当大的性能开销。与此同时，商业安全内存解决方案的设计往往忽略了内存可靠性机制(如ecc - dimm)的存在，这些机制提供了对内存错误的容忍度。幸运的是，ECC- dimm拥有一个额外的芯片，用于提供纠错码(ECC)，它可以与数据并行访问，可以利用它进行安全性优化。如果我们可以重新利用ecc芯片来存储一些对安全性和可靠性有用的元数据，那么这对两者都是有益的。为此，本文提出了Synergy，这是一种可靠性-安全性协同设计，可提高安全执行性能，同时为具有9芯片ecc - dimm的系统提供强大的可靠性。Synergy使用的洞察力是，能够检测数据篡改的mac也有助于检测内存错误。因此，mac最适合放置在ECC芯片内，以便在每次数据访问时并行访问。通过将MAC和Data放在一起，Synergy能够避免MAC的单独内存访问，从而减少安全内存系统的总体内存流量。此外，Synergy能够通过使用在9个芯片(8个数据和1个MAC)上构建的奇偶校验来容忍9个芯片中的1个芯片故障，这用于重建失败芯片的数据。对于内存密集型工作负载，与使用ecc - dimm的安全内存基准相比，Synergy提供了20%的加速，并将系统能量延迟产品减少了31%。与此同时，与提供单错误校正、双错误检测(SECDED)能力的ecc - dimm相比，Synergy的可靠性提高了185倍。Synergy使用商用ecc - dimm，不会产生任何额外的硬件开销或降低安全性。

{"title":"SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories","authors":"Gururaj Saileshwar, Prashant J. Nair, Prakash Ramrakhyani, Wendy Elsasser, Moinuddin K. Qureshi","doi":"10.1109/HPCA.2018.00046","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00046","url":null,"abstract":"Building trusted data-centers requires resilient memories which are protected from both adversarial attacks and errors. Unfortunately, the state-of-the-art memory security solutions incur considerable performance overheads due to accesses for security metadata like Message Authentication Codes (MACs). At the same time, commercial secure memory solutions tend to be designed oblivious to the presence of memory reliability mechanisms (such as ECC-DIMMs), that provide tolerance to memory errors. Fortunately, ECC-DIMMs possess an additional chip for providing error correction codes (ECC), that is accessed in parallel with data, which can be harnessed for security optimizations. If we can re-purpose the ECC-chip to store some metadata useful for security and reliability, it can prove beneficial to both. To this end, this paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs. Synergy uses the insight that MACs being capable of detecting data tampering are also useful for detecting memory errors. Therefore, MACs are best suited for being placed inside the ECC chip, to be accessed in parallel with each data access. By co-locating MAC and Data, Synergy is able to avoid a separate memory access for MAC and thereby reduce the overall memory traffic for secure memory systems. Furthermore, Synergy is able to tolerate 1 chip failure out of 9 chips by using a parity that is constructed over 9 chips (8 Data and 1 MAC), which is used for reconstructing the data of the failed chip. For memory intensive workloads, Synergy provides a speedup of 20% and reduces system Energy Delay Product by 31% compared to a secure memory baseline with ECC-DIMMs. At the same time, Synergy increases reliability by 185x compared to ECC-DIMMs that provide Single-Error Correction, Double-Error Detection (SECDED) capability. Synergy uses commercial ECC-DIMMs and does not incur any additional hardware overheads or reduction of security.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128911348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning 加速无监督gan深度学习的高效微架构设计

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00016

Mingcong Song, Jiaqi Zhang, Huixiang Chen, Tao Li

Recently, deep learning based approaches have emerged as indispensable tools to perform big data analytics. Normally, deep learning models are first trained with a supervised method and then deployed to execute various tasks. The supervised method involves extensive human efforts to collect and label the large-scale dataset, which becomes impractical in the big data era where raw data is largely un-labeled and uncategorized. Fortunately, the adversarial learning, represented by Generative Adversarial Network (GAN), enjoys a great success on the unsupervised learning. However, the distinct features of GAN, such as massive computing phases and non-traditional convolutions challenge the existing deep learning accelerator designs. In this work, we propose the first holistic solution for accelerating the unsupervised GAN-based Deep Learning. We overcome the above challenges with an algorithm and architecture co-design approach. First, we optimize the training procedure to reduce on-chip memory consumption. We then propose a novel time-multiplexed design to efficiently map the abundant computing phases to our microarchitecture. Moreover, we design high-efficiency dataflows to achieve high data reuse and skip the zero-operand multiplications in the non-traditional convolutions. Compared with traditional deep learning accelerators, our proposed design achieves the best performance (average 4.3X) with the same computing resource. Our design also has an average of 8.3X speedup over CPU and 6.2X energy-efficiency over NVIDIA GPU.

最近，基于深度学习的方法已经成为执行大数据分析不可或缺的工具。通常，深度学习模型首先使用监督方法进行训练，然后部署执行各种任务。监督方法需要大量的人力来收集和标记大规模数据集，这在大数据时代变得不切实际，因为原始数据在很大程度上是未标记和未分类的。幸运的是，以生成式对抗网络(GAN)为代表的对抗学习在无监督学习中取得了巨大的成功。在这项工作中，我们提出了第一个加速无监督gan深度学习的整体解决方案。我们通过算法和架构协同设计方法克服了上述挑战。首先，我们优化训练程序以减少片上存储器的消耗。然后，我们提出了一种新的时间复用设计，以有效地将丰富的计算阶段映射到我们的微架构中。此外，我们设计了高效的数据流，以实现高数据重用，并跳过了非传统卷积中的零操作数乘法。我们的设计还具有比CPU平均8.3倍的加速和比NVIDIA GPU平均6.2倍的能效。

{"title":"Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning","authors":"Mingcong Song, Jiaqi Zhang, Huixiang Chen, Tao Li","doi":"10.1109/HPCA.2018.00016","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00016","url":null,"abstract":"Recently, deep learning based approaches have emerged as indispensable tools to perform big data analytics. Normally, deep learning models are first trained with a supervised method and then deployed to execute various tasks. The supervised method involves extensive human efforts to collect and label the large-scale dataset, which becomes impractical in the big data era where raw data is largely un-labeled and uncategorized. Fortunately, the adversarial learning, represented by Generative Adversarial Network (GAN), enjoys a great success on the unsupervised learning. However, the distinct features of GAN, such as massive computing phases and non-traditional convolutions challenge the existing deep learning accelerator designs. In this work, we propose the first holistic solution for accelerating the unsupervised GAN-based Deep Learning. We overcome the above challenges with an algorithm and architecture co-design approach. First, we optimize the training procedure to reduce on-chip memory consumption. We then propose a novel time-multiplexed design to efficiently map the abundant computing phases to our microarchitecture. Moreover, we design high-efficiency dataflows to achieve high data reuse and skip the zero-operand multiplications in the non-traditional convolutions. Compared with traditional deep learning accelerators, our proposed design achieves the best performance (average 4.3X) with the same computing resource. Our design also has an average of 8.3X speedup over CPU and 6.2X energy-efficiency over NVIDIA GPU.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116320354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Power and Energy Characterization of an Open Source 25-Core Manycore Processor 开源25核多核处理器的功率和能量表征

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00070

Michael McKeown, Alexey Lavrov, Mohammad Shahrad, Paul J. Jackson, Yaosheng Fu, Jonathan Balkind, Tri M. Nguyen, Katie Lim, Yanqi Zhou, D. Wentzlaff

The end of Dennard’s scaling and the looming power wall have made power and energy primary design goals for modern processors. Further, new applications such as cloud computing and Internet of Things (IoT) continue to necessitate increased performance and energy efficiency. Manycore processors show potential in addressing some of these issues. However, there is little detailed power and energy data on manycore processors. In this work, we carefully study detailed power and energy characteristics of Piton, a 25-core modern open source academic processor, including voltage versus frequency scaling, energy per instruction (EPI), memory system energy, network-on-chip (NoC) energy, thermal characteristics, and application performance and power consumption. This is the first detailed power and energy characterization of an open source manycore design implemented in silicon. The open source nature of the processor provides increased value, enabling detailed characterization verified against simulation and the ability to correlate results with the design and register transfer level (RTL) model. Additionally, this enables other researchers to utilize this work to build new power models, devise new research directions, and perform accurate power and energy research using the open source processor. The characterization data reveals a number of interesting insights, including that operand values have a large impact on EPI, recomputing data can be more energy efficient than loading it from memory, on-chip data transmission (NoC) energy is low, and insights on energy efficient multithreaded core design. All data collected and the hardware infrastructure used is open source and available for download at http://www.openpiton.org.

登纳德缩放理论的终结和迫在眉睫的“功率墙”使得功耗和能耗成为现代处理器设计的首要目标。此外，云计算和物联网(IoT)等新应用继续要求提高性能和能源效率。多核处理器显示出解决这些问题的潜力。然而，关于多核处理器的详细功率和能量数据却很少。在这项工作中，我们仔细研究了Piton，一个25核现代开源学术处理器的详细功率和能量特性，包括电压与频率缩放，每条指令能量(EPI)，存储系统能量，片上网络(NoC)能量，热特性，以及应用性能和功耗。这是在硅上实现的开源多核设计的第一个详细的功率和能量表征。处理器的开源特性提供了更高的价值，可以通过仿真验证详细的特性，并能够将结果与设计和注册传输级别(RTL)模型相关联。此外，这使其他研究人员能够利用这项工作建立新的功率模型，设计新的研究方向，并使用开源处理器进行准确的功率和能源研究。表征数据揭示了许多有趣的见解，包括操作数值对EPI有很大影响，重新计算数据可能比从内存加载数据更节能，片上数据传输(NoC)能量低，以及节能多线程核心设计的见解。收集的所有数据和使用的硬件基础设施都是开源的，可以从http://www.openpiton.org下载。

{"title":"Power and Energy Characterization of an Open Source 25-Core Manycore Processor","authors":"Michael McKeown, Alexey Lavrov, Mohammad Shahrad, Paul J. Jackson, Yaosheng Fu, Jonathan Balkind, Tri M. Nguyen, Katie Lim, Yanqi Zhou, D. Wentzlaff","doi":"10.1109/HPCA.2018.00070","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00070","url":null,"abstract":"The end of Dennard’s scaling and the looming power wall have made power and energy primary design goals for modern processors. Further, new applications such as cloud computing and Internet of Things (IoT) continue to necessitate increased performance and energy efficiency. Manycore processors show potential in addressing some of these issues. However, there is little detailed power and energy data on manycore processors. In this work, we carefully study detailed power and energy characteristics of Piton, a 25-core modern open source academic processor, including voltage versus frequency scaling, energy per instruction (EPI), memory system energy, network-on-chip (NoC) energy, thermal characteristics, and application performance and power consumption. This is the first detailed power and energy characterization of an open source manycore design implemented in silicon. The open source nature of the processor provides increased value, enabling detailed characterization verified against simulation and the ability to correlate results with the design and register transfer level (RTL) model. Additionally, this enables other researchers to utilize this work to build new power models, devise new research directions, and perform accurate power and energy research using the open source processor. The characterization data reveals a number of interesting insights, including that operand values have a large impact on EPI, recomputing data can be more energy efficient than loading it from memory, on-chip data transmission (NoC) energy is low, and insights on energy efficient multithreaded core design. All data collected and the hardware infrastructure used is open source and available for download at http://www.openpiton.org.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131761996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35