2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献_第5页

Sthira: A Formal Approach to Minimize Voltage Guardbands under Variation in Networks-on-Chip for Energy Efficiency 基于能效的片上网络变化下最小化电压保护带的正式方法

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.23

Raghavendra Pradyumna Pothukuchi, Amin Ansari, Bhargava Gopireddy, J. Torrellas

Networks-on-Chip (NoCs) in chip multiprocessors are prone to within-die process variation as they span the whole chip. To tolerate variation, their voltages (Vdd) carry over-provisioned guardbands. As a result, prior work has proposed to save energy by operating at reduced Vdd while occasionally suffering and fixing errors. Unfortunately, these proposals use heuristic controller designs that provide no error bounds guarantees.In this work, we develop a scheme that dynamically minimizes the Vdd of groups of routers in a variation-prone NoC using formal control-theoretic methods. The scheme, called Sthira, saves substantial energy while guaranteeing the stability and convergence of error rates. We also enhance the scheme with a low-cost secondary network that retransmits erroneous packets for higher energy efficiency. The enhanced scheme is called Sthira+. We evaluate Sthira and Sthira+ with simulations of NoCs with 64-100 routers. In an NoC with 8 routers per Vdd domain, our schemes reduce the average energy consumptionof the NoC by 27%; in a futuristic NoC with one router per Vdd domain, Sthira+ and Sthira reduce the average energy consumption by 36% and 32%, respectively. The performance impact is negligible. These are significant savings over the state-of-the-art. We conclude that formal control is essential, and that the cheaper Sthira is more cost-effective than Sthira+.

片上多处理器中的片上网络(noc)由于跨越整个芯片，容易发生模内工艺变化。为了容忍变化，它们的电压(Vdd)带有过度配置的保护带。因此，先前的工作建议在偶尔遭受和修复错误的情况下，通过降低Vdd运行来节省能源。不幸的是，这些建议使用启发式控制器设计，不提供错误边界保证。在这项工作中，我们开发了一种方案，该方案使用形式控制理论方法动态地最小化易发生变化的NoC中路由器组的Vdd。该方案被称为Sthira，在保证错误率的稳定性和收敛性的同时节省了大量的能量。我们还通过一个低成本的二次网络来增强该方案，该二次网络可以重传错误的数据包，从而提高能源效率。增强方案被称为Sthira+。我们通过模拟64-100路由器的noc来评估Sthira和Sthira+。在每个Vdd域有8个路由器的NoC中，我们的方案将NoC的平均能耗降低了27%;在未来的NoC中，每个Vdd域有一个路由器，Sthira+和Sthira分别减少了36%和32%的平均能耗。性能影响可以忽略不计。与最先进的技术相比，这是一个显著的节省。我们的结论是，正式的控制是必不可少的，并且便宜的Sthira比Sthira+更具成本效益。

{"title":"Sthira: A Formal Approach to Minimize Voltage Guardbands under Variation in Networks-on-Chip for Energy Efficiency","authors":"Raghavendra Pradyumna Pothukuchi, Amin Ansari, Bhargava Gopireddy, J. Torrellas","doi":"10.1109/PACT.2017.23","DOIUrl":"https://doi.org/10.1109/PACT.2017.23","url":null,"abstract":"Networks-on-Chip (NoCs) in chip multiprocessors are prone to within-die process variation as they span the whole chip. To tolerate variation, their voltages (Vdd) carry over-provisioned guardbands. As a result, prior work has proposed to save energy by operating at reduced Vdd while occasionally suffering and fixing errors. Unfortunately, these proposals use heuristic controller designs that provide no error bounds guarantees.In this work, we develop a scheme that dynamically minimizes the Vdd of groups of routers in a variation-prone NoC using formal control-theoretic methods. The scheme, called Sthira, saves substantial energy while guaranteeing the stability and convergence of error rates. We also enhance the scheme with a low-cost secondary network that retransmits erroneous packets for higher energy efficiency. The enhanced scheme is called Sthira+. We evaluate Sthira and Sthira+ with simulations of NoCs with 64-100 routers. In an NoC with 8 routers per Vdd domain, our schemes reduce the average energy consumptionof the NoC by 27%; in a futuristic NoC with one router per Vdd domain, Sthira+ and Sthira reduce the average energy consumption by 36% and 32%, respectively. The performance impact is negligible. These are significant savings over the state-of-the-art. We conclude that formal control is essential, and that the cheaper Sthira is more cost-effective than Sthira+.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132931050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries 通过使TLB表项自失效来避免TLB宕机

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.38

Amro Awad, Arkaprava Basu, S. Blagodurov, Yan Solihin, G. Loh

Updates to a process's page table entry (PTE) renders any existing copies of that PTE in any of a system's TLBs stale. To prevent a process from making illegal memory accesses using stale TLB entries, the operating system (OS) performs a costly TLB shootdown operation. Rather than explicitly issuing shootdowns, we propose a coordinated TLB and page table management mechanism where an expirationtime is associated with each TLB entry. An expired TLB entry is treated as invalid. For each PTE, the OS then tracks the latest expiration time of any TLB entry potentially caching that PTE. No shootdown is issued if the OS modifies a PTE when its corresponding latest expiration time has already passed.In this paper, we explain the hardware and OS support required to support Self-invalidating TLB entries (SITE). As an emerging use case that needs fast TLB shootdowns, we consider memory systems consisting of different types of memory (e.g., faster DRAM and slower non-volatile memory) where aggressive migrations are desirable to keep frequently accessed pages in faster memory, but pages cannot migratetoo often because each migration requires a PTE update and corresponding TLB shootdown. We demonstrate that such heterogeneous memory systems augmented with SITE can allow an average performance improvement of 45.5% over a similar system with traditional TLB shootdowns by avoiding more than 65% of the shootdowns.

对进程页表项(PTE)的更新会使该PTE在系统的任何tlb中的任何现有副本失效。为了防止进程使用过时的TLB项进行非法内存访问，操作系统(OS)执行代价高昂的TLB关机操作。我们提出了一种协调的TLB和页表管理机制，而不是显式地发出关机，其中每个TLB项都关联一个过期时间。过期的TLB表项被视为无效。对于每个PTE，操作系统跟踪任何可能缓存该PTE的TLB条目的最新过期时间，如果操作系统在相应的最新过期时间已经过去时修改PTE，则不会发出关机。在本文中，我们解释了支持自失效TLB项(SITE)所需的硬件和操作系统支持。作为一个需要快速TLB停机的新兴用例，我们考虑由不同类型的内存(例如，更快的DRAM和更慢的非易失性内存)组成的内存系统，其中积极的迁移是可取的，以便将频繁访问的页面保存在更快的内存中，但是页面不能太频繁地迁移，因为每次迁移都需要PTE更新和相应的TLB停机。我们证明，与具有传统TLB停机的类似系统相比，使用SITE增强的这种异构内存系统可以避免超过65%的停机，从而使平均性能提高45.5%。

{"title":"Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries","authors":"Amro Awad, Arkaprava Basu, S. Blagodurov, Yan Solihin, G. Loh","doi":"10.1109/PACT.2017.38","DOIUrl":"https://doi.org/10.1109/PACT.2017.38","url":null,"abstract":"Updates to a process's page table entry (PTE) renders any existing copies of that PTE in any of a system's TLBs stale. To prevent a process from making illegal memory accesses using stale TLB entries, the operating system (OS) performs a costly TLB shootdown operation. Rather than explicitly issuing shootdowns, we propose a coordinated TLB and page table management mechanism where an expirationtime is associated with each TLB entry. An expired TLB entry is treated as invalid. For each PTE, the OS then tracks the latest expiration time of any TLB entry potentially caching that PTE. No shootdown is issued if the OS modifies a PTE when its corresponding latest expiration time has already passed.In this paper, we explain the hardware and OS support required to support Self-invalidating TLB entries (SITE). As an emerging use case that needs fast TLB shootdowns, we consider memory systems consisting of different types of memory (e.g., faster DRAM and slower non-volatile memory) where aggressive migrations are desirable to keep frequently accessed pages in faster memory, but pages cannot migratetoo often because each migration requires a PTE update and corresponding TLB shootdown. We demonstrate that such heterogeneous memory systems augmented with SITE can allow an average performance improvement of 45.5% over a similar system with traditional TLB shootdowns by avoiding more than 65% of the shootdowns.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124064345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

DRUT: An Efficient Turbo Boost Solution via Load Balancing in Decoupled Look-Ahead Architecture DRUT:一种基于解耦前瞻架构负载平衡的高效Turbo Boost解决方案

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.35

Raj Parihar, Michael C. Huang

In spite of the multicore revolution, high single thread performance still plays an important role in ensuring a decentoverall gain. Look-ahead is a proven strategy in uncoveringimplicit parallelism; however, a conventional out-of-ordercore quickly becomes resource-inefficient when looking beyond a short distance. An effective approach is to use an in-dependent look-ahead thread running on a separate contextguided by a program slice known as the skeleton. We observethat fixed heuristics to generate skeletons are often suboptimal. As a consequence, look-ahead agent is not able to targetsufficient bottlenecks to reap all the benefits it should.In this paper, we present DRUT, a holistic hardware-software solution, which achieves good single thread performance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branchbased code modules (we call them Do-It-Yourself or DIY)that enable a faster look-ahead thread without compromisingthe quality of the look-ahead. Second, we extend our tuningmechanism to any arbitrary code region and use a profile-driven technique to tune the skeleton for the whole program.Assisted by the aforementioned techniques, look-aheadthread improves the performance of a baseline decoupledlook-ahead by up to 1.93× with a geometric mean of 1.15×. Our techniques, combined with the weak dependence removal technique, improve the performance of a baselinelook-ahead by up to 2.12× with a geometric mean of 1.20×. This is an impressive performance gain of 1.61× over thesingle-thread baseline, which is much better compared toconventional Turbo Boost with a comparable energy budget.

尽管出现了多核革命，但高单线程性能仍然在确保总体增益方面发挥着重要作用。前瞻性是发现隐式并行性的一种行之有效的策略;然而，传统的乱序核在观察短距离时很快就会变得资源效率低下。一种有效的方法是使用一个独立的预检线程，它运行在一个独立的上下文中，由一个称为骨架的程序片引导。我们观察到，固定启发式生成骨架通常是次优的。因此，前瞻代理不能瞄准足够的瓶颈来获得它应该获得的所有好处。在本文中，我们提出了DRUT，一个整体的硬件软件解决方案，它通过有效地调整前瞻性框架来实现良好的单线程性能。首先，我们建议对基于分支的代码模块(我们称之为自己动手或DIY)进行一些动态转换，以实现更快的预检线程，而不会影响预检的质量。其次，我们将调优机制扩展到任意代码区域，并使用配置文件驱动的技术来调优整个程序的框架。在上述技术的帮助下，lookaheadthread将基线解耦向前看的性能提高了1.93倍，几何平均值为1.15倍。我们的技术与弱依赖去除技术相结合，将基线前瞻性的性能提高了2.12倍，几何平均值为1.20倍。这是一个令人印象深刻的性能增益1.61倍的单线程基线，这是比传统的涡轮增压与可比的能源预算要好得多。

{"title":"DRUT: An Efficient Turbo Boost Solution via Load Balancing in Decoupled Look-Ahead Architecture","authors":"Raj Parihar, Michael C. Huang","doi":"10.1109/PACT.2017.35","DOIUrl":"https://doi.org/10.1109/PACT.2017.35","url":null,"abstract":"In spite of the multicore revolution, high single thread performance still plays an important role in ensuring a decentoverall gain. Look-ahead is a proven strategy in uncoveringimplicit parallelism; however, a conventional out-of-ordercore quickly becomes resource-inefficient when looking beyond a short distance. An effective approach is to use an in-dependent look-ahead thread running on a separate contextguided by a program slice known as the skeleton. We observethat fixed heuristics to generate skeletons are often suboptimal. As a consequence, look-ahead agent is not able to targetsufficient bottlenecks to reap all the benefits it should.In this paper, we present DRUT, a holistic hardware-software solution, which achieves good single thread performance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branchbased code modules (we call them Do-It-Yourself or DIY)that enable a faster look-ahead thread without compromisingthe quality of the look-ahead. Second, we extend our tuningmechanism to any arbitrary code region and use a profile-driven technique to tune the skeleton for the whole program.Assisted by the aforementioned techniques, look-aheadthread improves the performance of a baseline decoupledlook-ahead by up to 1.93× with a geometric mean of 1.15×. Our techniques, combined with the weak dependence removal technique, improve the performance of a baselinelook-ahead by up to 2.12× with a geometric mean of 1.20×. This is an impressive performance gain of 1.61× over thesingle-thread baseline, which is much better compared toconventional Turbo Boost with a comparable energy budget.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124696415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Generalized Framework for Automatic Scripting Language Parallelization 自动脚本语言并行化的通用框架

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.28

Taewook Oh, S. Beard, Nick P. Johnson, S. Popovych, David I. August

Computational scientists are typically not expert programmers, and thus work in easy to use dynamic languages. However, they have very high performance requirements, due to their large datasets and experimental setups. Thus, the performance required for computational science must be extracted from dynamic languages in a manner that is transparent to the programmer. Current approaches to optimize and parallelize dynamic languages, such as just-in-time compilation and highly optimized interpreters, require a huge amount of implementation effort and are typically only effective for a single language. However, scientists in different fields use different languages, depending upon their needs.This paper presents techniques to enable automatic extraction of parallelism within scripts that are universally applicable across multiple different dynamic scripting languages. The key insight is that combining a script with its interpreter, through program specialization techniques, will embed any parallelism within the script into the combined program that can then be extracted via automatic parallelization techniques. Additionally, this paper presents several enhancements to existing speculative automatic parallelization techniques to handle the dependence patterns created by the specialization process. A prototype of the proposed technique, called Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script interpreters with 6 input linear algebra kernel scripts each. The resulting geomean speedup of 5.10× on a 24-core machine shows the potential of the generalized approach in automatic extraction of parallelism in dynamic scripting languages.

计算科学家通常不是专业的程序员，因此他们使用易于使用的动态语言工作。然而，由于它们的大型数据集和实验设置，它们具有非常高的性能要求。因此，计算科学所需的性能必须以一种对程序员透明的方式从动态语言中提取出来。当前优化和并行化动态语言的方法，如即时编译和高度优化的解释器，需要大量的实现工作，并且通常只对单一语言有效。然而，不同领域的科学家根据他们的需要使用不同的语言。本文介绍了在脚本中自动提取并行性的技术，这些技术普遍适用于多种不同的动态脚本语言。关键的见解是，通过程序专门化技术将脚本与其解释器组合在一起，将脚本中的任何并行性嵌入到组合的程序中，然后可以通过自动并行化技术提取这些并行性。此外，本文还介绍了对现有推测自动并行化技术的一些增强，以处理专门化过程创建的依赖模式。所提出的技术的原型，称为并行化部分求值(PEP)，对两个开源脚本解释器进行了评估，每个解释器有6个输入线性代数内核脚本。结果在24核机器上的几何加速提高了5.10倍，显示了在动态脚本语言中自动提取并行性的通用方法的潜力。

{"title":"A Generalized Framework for Automatic Scripting Language Parallelization","authors":"Taewook Oh, S. Beard, Nick P. Johnson, S. Popovych, David I. August","doi":"10.1109/PACT.2017.28","DOIUrl":"https://doi.org/10.1109/PACT.2017.28","url":null,"abstract":"Computational scientists are typically not expert programmers, and thus work in easy to use dynamic languages. However, they have very high performance requirements, due to their large datasets and experimental setups. Thus, the performance required for computational science must be extracted from dynamic languages in a manner that is transparent to the programmer. Current approaches to optimize and parallelize dynamic languages, such as just-in-time compilation and highly optimized interpreters, require a huge amount of implementation effort and are typically only effective for a single language. However, scientists in different fields use different languages, depending upon their needs.This paper presents techniques to enable automatic extraction of parallelism within scripts that are universally applicable across multiple different dynamic scripting languages. The key insight is that combining a script with its interpreter, through program specialization techniques, will embed any parallelism within the script into the combined program that can then be extracted via automatic parallelization techniques. Additionally, this paper presents several enhancements to existing speculative automatic parallelization techniques to handle the dependence patterns created by the specialization process. A prototype of the proposed technique, called Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script interpreters with 6 input linear algebra kernel scripts each. The resulting geomean speedup of 5.10× on a 24-core machine shows the potential of the generalized approach in automatic extraction of parallelism in dynamic scripting languages.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116332147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Cache Automaton: Repurposing Caches for Automata Processing 缓存自动机:为自动机处理重新利用缓存

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.51

Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das

Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requir

有限状态自动机(FSA)是一种强大的计算模型，用于从系统日志、社交媒体帖子、电子邮件和新闻文章等大量非结构化数据流(tb / pb)中提取模式。FSA也广泛应用于网络安全[6]、生物信息学[4]，实现高效的模式匹配。以计算为中心的架构，如cpu和gpu - pu，由于不定期的内存访问，在自动机处理上表现不佳，并且由于内存带宽限制，每个周期只能处理很少的状态转换。另一方面，以内存为中心的架构，如基于dram的美光自动处理器(AP)[2]，由于大量的位级并行性和减少的数据移动/指令处理开销，可以在单个周期内处理高达48K的状态转换。Micron Automata处理器:Micron AP重新利用DRAM列来存储FSM状态，并将行地址用于流输入符号。它实现了同构非确定性有限状态自动机(NFA)，其中每个状态只在一个输入符号上有传入转换。每个状态都有一个标签，这是需要匹配的符号的单一编码。每个输入符号的处理分为两个阶段:(1)状态匹配，其中确定其标签与输入符号匹配的状态;(2)状态转换，其中每个匹配的状态激活其相应的下一个状态。我们探索基于sram的最后一级缓存(llc)作为自动处理的基板，其速度更快，集成在处理器芯片上。缓存容量:一个直接的问题是缓存是否可以存储大型自动机。有趣的是，我们观察到AP牺牲了很大一部分芯片面积来容纳路由矩阵和自动处理所需的其他非内存组件，并且只有与缓存相当的封装密度。将缓存重新用于自动机处理:虽然迁移到SRAM的内存技术优势是显而易见的，但将40-60%的被动LLC芯片面积重新用于大规模并行自动机计算带来了一些挑战。每次LLC访问(~ 20-30个周期@ 4GHz)处理一个输入符号，将导致与基于dram的AP (~ 200 MHz)相当的工作频率，从而抵消了存储技术的优势。进一步提高工作频率只能通过构建一个(1)识别LLC切片内部几何形状的原位计算模型，以及(2)加速符号处理的状态匹配(数组读取)和状态转换(开关+导线传播延迟)阶段。加速状态匹配:这是一个挑战，因为工业LLC子阵列通常有4-8位行共享一个感测放大器。这意味着存储的4-8个状态中只有1个可以匹配每个周期，从而导致总体利用率不足和并行性损失。为了解决这个问题，我们利用了利用状态匹配的空间局域性的感放大器循环技术。加速状态转换:以低区域成本加速状态转换需要设计一个可扩展的互连，该互连可以在同一周期内有效地编码和支持多个状态转换，通常是到相同的目标状态。我们观察到，8T SRAM存储器阵列可以被重新利用，成为一个紧凑的状态转换横杆，用于支持大状态扇入的自动机。扩展到大型自动机:支持状态之间的全对全连接需要非常大且缓慢的交换机。现实世界中的大型自动机通常由几个连接的组件组成，这些组件被分组成紧密连接的分区，它们之间只有很少(8-16)个互连。这促使我们探索一种分层交换机拓扑，其中本地交换机提供丰富的分区内连通性，全局交换机提供稀疏的分区间连通性。为此，我们还设计了一个编译器来自动将状态映射到SRAM数组。我们在来自ANMLZoo[5]和Regex[1]套件的20个不同基准测试中提出并评估了两种架构和映射策略，一种针对性能进行了优化，另一种针对空间进行了优化。我们还演示了浏览器前端解析活动的加速，作为一个案例研究，其中FSA计算是一个瓶颈，占用了网页加载时间的40%。性能优化和空间优化设计分别比美光的AP加速15倍和9倍。

{"title":"Cache Automaton: Repurposing Caches for Automata Processing","authors":"Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das","doi":"10.1109/PACT.2017.51","DOIUrl":"https://doi.org/10.1109/PACT.2017.51","url":null,"abstract":"Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requir","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127500744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Weak Memory Models: Balancing Definitional Simplicity and Implementation Flexibility 弱内存模型:平衡定义的简单性和实现的灵活性

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-07-19 DOI: 10.1109/PACT.2017.29

Sizhuo Zhang, M. Vijayaraghavan, Arvind

The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. We give the operational definitions of both models using Instantaneous Instruction Execution (I2E), which has been used in the definitions of SC and TSO. We also show how both models can be implemented using conventional cache-coherent memory systems and out-of-order processors, and encompasses the behaviors of most known optimizations.

RISC-V是一种新开发的开源ISA，其内存模型尚未最终确定，因此提供了评估现有内存模型的机会。我们认为RISC-V不应该采用POWER或ARM的内存模型，因为它们的公理和操作定义太复杂了。我们提出了两种新的弱内存模型:WMM和WMM- s，它们在定义简单性和实现灵活性之间取得了不同的平衡。两者都允许所有指令重排序，除了存储器的负载超限。我们展示了这个限制对性能的影响很小，而且它大大简化了操作定义。它还排除了困扰许多定义的凭空出现的问题。WMM很简单(它类似于Alpha内存模型)，但是它不允许由于共享存储缓冲区和共享透写缓存(在POWER处理器中可以看到)而产生的行为。另一方面，WMM-S更复杂，允许这些行为。我们使用瞬时指令执行(I2E)给出了这两个模型的操作定义，该定义已用于SC和TSO的定义。我们还展示了如何使用传统的缓存一致内存系统和乱序处理器来实现这两个模型，并包含了大多数已知优化的行为。

{"title":"Weak Memory Models: Balancing Definitional Simplicity and Implementation Flexibility","authors":"Sizhuo Zhang, M. Vijayaraghavan, Arvind","doi":"10.1109/PACT.2017.29","DOIUrl":"https://doi.org/10.1109/PACT.2017.29","url":null,"abstract":"The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. We give the operational definitions of both models using Instantaneous Instruction Execution (I2E), which has been used in the definitions of SC and TSO. We also show how both models can be implemented using conventional cache-coherent memory systems and out-of-order processors, and encompasses the behaviors of most known optimizations.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128317347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Proxy Benchmarks for Emerging Big-Data Workloads 新兴大数据工作负载的代理基准

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-04-01 DOI: 10.1109/PACT.2017.44

Reena Panda, L. John

Early design-space evaluation of computer-systems is usually performed using performance models such as detailed simulators, RTL-based models etc. Unfortunately, it is very challenging (often impossible) to run many emerging applications on detailed performance models owing to their complex application software-stacks, significantly long run times, system dependencies and the limited speed/potential of early performance models. To overcome these challenges in benchmarking complex, long-running database applications, we propose a fast and efficient proxy generation methodology, PerfProx that can generate miniature proxy benchmarks, which are representative of the performance of real-world database applications and yet, converge to results quickly and do not need any complex software-stack support. Past research on proxy generation utilizes detailed micro-architecture independent metrics derived from detailed functional simulators, which are often difficult to generate for many emerging applications. PerfProx enables fast and efficient proxy generation using performance metrics derived primarily from hardware performance counters. We evaluate the proposed proxy generation approach on three modern, real-world SQL and NoSQL databases, Cassandra, MongoDB and MySQL running both the data-serving and data-analytics class of applications on different hardware platforms and cache/TLB configurations. The proxy benchmarks mimic the performance (IPC) of the original database applications with ∼94.2% (avg) accuracy. We further demonstrate that the proxies mimic original application performance across several other key metrics, while significantly reducing the instruction counts.

计算机系统的早期设计空间评估通常使用性能模型进行，如详细模拟器、基于rtl的模型等。不幸的是，在详细的性能模型上运行许多新兴应用程序是非常具有挑战性的(通常是不可能的)，因为它们的应用程序软件堆栈非常复杂，运行时间非常长，系统依赖关系以及早期性能模型的速度/潜力有限。为了克服对复杂、长时间运行的数据库应用程序进行基准测试的挑战，我们提出了一种快速高效的代理生成方法，PerfProx，它可以生成微型代理基准，代表真实数据库应用程序的性能，并且可以快速收敛到结果，不需要任何复杂的软件堆栈支持。过去对代理生成的研究使用了详细的微架构独立度量，这些度量来源于详细的功能模拟器，通常难以为许多新兴应用生成。PerfProx使用主要来自硬件性能计数器的性能指标来实现快速高效的代理生成。我们在三个现代的、真实的SQL和NoSQL数据库(Cassandra、MongoDB和MySQL)上评估了拟议的代理生成方法，这些数据库在不同的硬件平台和缓存/TLB配置上运行数据服务和数据分析类应用程序。代理基准测试模拟原始数据库应用程序的性能(IPC)，准确率为94.2%(平均)。我们进一步证明代理在其他几个关键指标上模拟了原始应用程序的性能，同时显著减少了指令计数。

{"title":"Proxy Benchmarks for Emerging Big-Data Workloads","authors":"Reena Panda, L. John","doi":"10.1109/PACT.2017.44","DOIUrl":"https://doi.org/10.1109/PACT.2017.44","url":null,"abstract":"Early design-space evaluation of computer-systems is usually performed using performance models such as detailed simulators, RTL-based models etc. Unfortunately, it is very challenging (often impossible) to run many emerging applications on detailed performance models owing to their complex application software-stacks, significantly long run times, system dependencies and the limited speed/potential of early performance models. To overcome these challenges in benchmarking complex, long-running database applications, we propose a fast and efficient proxy generation methodology, PerfProx that can generate miniature proxy benchmarks, which are representative of the performance of real-world database applications and yet, converge to results quickly and do not need any complex software-stack support. Past research on proxy generation utilizes detailed micro-architecture independent metrics derived from detailed functional simulators, which are often difficult to generate for many emerging applications. PerfProx enables fast and efficient proxy generation using performance metrics derived primarily from hardware performance counters. We evaluate the proposed proxy generation approach on three modern, real-world SQL and NoSQL databases, Cassandra, MongoDB and MySQL running both the data-serving and data-analytics class of applications on different hardware platforms and cache/TLB configurations. The proxy benchmarks mimic the performance (IPC) of the original database applications with ∼94.2% (avg) accuracy. We further demonstrate that the proxies mimic original application performance across several other key metrics, while significantly reducing the instruction counts.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

A GPU-Friendly Skiplist Algorithm 一个gpu友好的Skiplist算法

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-01-26 DOI: 10.1145/3018743.3019032

Nurit Moscovici, Nachshon Cohen, E. Petrank

We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent data structures, such as skiplists, which are widely used in general purpose computations. Our design utilizes array-based nodes which are accessed and updated by warp-cooperative functions, thus taking advantage of the fact that GPUs are most efficient when memory accesses are coalesced and execution divergence is minimized. The proposed design has been implemented, and measurements demonstrate improved performance of up to 11.6x over skiplist designs for the GPU existing today.

我们提出了一种针对图形处理单元(gpu)优化的基于锁的细粒度跳过列表设计。虽然gpu经常用于加速流并行计算，但如何有效地卸载具有更复杂数据(不规则访问和细粒度同步)的并发计算仍然是一个重大挑战。这种计算的自然构建块将是并发数据结构，例如在通用计算中广泛使用的skiplist。我们的设计利用基于数组的节点，通过warp-cooperative函数访问和更新，从而利用gpu在内存访问合并和执行分歧最小化时效率最高的事实。提议的设计已经实施，测量表明，与目前现有的GPU跳过列表设计相比，性能提高了11.6倍。

引用次数: 19

Near-Memory Address Translation 近内存地址转换

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2016-12-01 DOI: 10.1109/PACT.2017.56

Javier Picorel, Djordje Jevdjic, B. Falsafi

Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common.In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.

同一芯片上的内存和逻辑集成正变得越来越具有成本效益，这为将数据密集型功能卸载到存储芯片内的处理单元创造了机会。在传统系统中引入内存端处理单元(mpu)时，虚拟内存是第一个大问题:没有有效的硬件支持地址转换，mpu的适用性非常有限。不幸的是，传统的翻译机制无法提供快速翻译，因为当代存储器超出了tlb的能力范围，使得昂贵的页遍历变得常见。在本文中，我们首先表明，将任何虚拟页面映射到任何页面框架的历史上重要的灵活性在今天的服务器中是不必要的。我们发现，虽然限制虚拟到物理映射的关联性不会带来任何损失，但如果与MPU内存中的仔细数据放置相结合，它可能会破坏翻译-然后获取的序列化，允许翻译和数据获取独立并行地进行。我们提出了分布式倒页表(DIPTA)，这是一种近内存结构，其中最小的内存分区保留其数据共享的翻译信息，确保翻译与数据获取一起完成。DIPTA完全消除了翻译的性能开销，与使用4KB和1GB页面的传统翻译相比，实现了高达3.81倍和2.13倍的速度提升。

{"title":"Near-Memory Address Translation","authors":"Javier Picorel, Djordje Jevdjic, B. Falsafi","doi":"10.1109/PACT.2017.56","DOIUrl":"https://doi.org/10.1109/PACT.2017.56","url":null,"abstract":"Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common.In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117166223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

POSTER: Application-Driven Near-Data Processing for Similarity Search 海报:应用驱动的近数据处理相似搜索

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2016-06-12 DOI: 10.1109/PACT.2017.25

Vincent T. Lee, Amrita Mazumdar, Carlo C. del Mundo, Armin Alaghi, L. Ceze, M. Oskin

Similarity search is a key to important applications such as content-based search, deduplication, natural language processing, computer vision, databases, and graphics. At its core, similarity search manifests as k-nearest neighbors (kNN) which consists of parallel distance calculations and a top-k sort. While kNN is poorly supported by today's architectures, it is ideal for near-data processing because of its high memory bandwidth requirements. This work proposes a near-data processing accelerator for similarity search: the similarity search associative memory (SSAM).

相似度搜索是诸如基于内容的搜索、重复数据删除、自然语言处理、计算机视觉、数据库和图形等重要应用程序的关键。在其核心，相似性搜索表现为k近邻(kNN)，由并行距离计算和top-k排序组成。虽然目前的体系结构对kNN的支持很差，但由于其高内存带宽要求，它是近数据处理的理想选择。本文提出了一种用于相似搜索的近数据处理加速器:相似搜索联想记忆(SSAM)。

引用次数: 7