缓存自动机:为自动机处理重新利用缓存

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2017-09-01 DOI:10.1109/PACT.2017.51

Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das

{"title":"缓存自动机:为自动机处理重新利用缓存","authors":"Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das","doi":"10.1109/PACT.2017.51","DOIUrl":null,"url":null,"abstract":"Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requires the design of a scalable interconnect that efficiently encodes and supports multiple state-transitions on the same cycle, often to the same destination state. We observe that a 8T SRAM memory array can be re-purposed to become a compact state-transition crossbar for automata supporting large fan-in for states. Scaling to large automata: Supporting all-to-all connectivity between states requires prohibitively large and slow switches. Large real-world automata are typically composed of several connected components, grouped into densely connected partitions with only few (8-16) interconnections between them. This motivated us to explore a hierarchical switch topology with local switches providing rich intra-partition connectivity and global switches providing sparse inter-partition connectivity. To this end, we also design a compiler that automates the mapping of states into SRAM arrays. We propose and evaluate two architectures and mapping policies, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks from ANMLZoo [5] and Regex [1] suites. We also demonstrate acceleration of parsing activities in browser front-end as a case study where FSA computations are a bottleneck taking up to 40% of the loading time of web pages [3]. The performance optimized and space optimized designs provide a speedup of 15× and 9× over Micron's AP respectively.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Cache Automaton: Repurposing Caches for Automata Processing\",\"authors\":\"Arun K. Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, D. Blaauw, D. Sylvester, R. Das\",\"doi\":\"10.1109/PACT.2017.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requires the design of a scalable interconnect that efficiently encodes and supports multiple state-transitions on the same cycle, often to the same destination state. We observe that a 8T SRAM memory array can be re-purposed to become a compact state-transition crossbar for automata supporting large fan-in for states. Scaling to large automata: Supporting all-to-all connectivity between states requires prohibitively large and slow switches. Large real-world automata are typically composed of several connected components, grouped into densely connected partitions with only few (8-16) interconnections between them. This motivated us to explore a hierarchical switch topology with local switches providing rich intra-partition connectivity and global switches providing sparse inter-partition connectivity. To this end, we also design a compiler that automates the mapping of states into SRAM arrays. We propose and evaluate two architectures and mapping policies, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks from ANMLZoo [5] and Regex [1] suites. We also demonstrate acceleration of parsing activities in browser front-end as a case study where FSA computations are a bottleneck taking up to 40% of the loading time of web pages [3]. The performance optimized and space optimized designs provide a speedup of 15× and 9× over Micron's AP respectively.\",\"PeriodicalId\":438103,\"journal\":{\"name\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2017.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

有限状态自动机(FSA)是一种强大的计算模型，用于从系统日志、社交媒体帖子、电子邮件和新闻文章等大量非结构化数据流(tb / pb)中提取模式。FSA也广泛应用于网络安全[6]、生物信息学[4]，实现高效的模式匹配。以计算为中心的架构，如cpu和gpu - pu，由于不定期的内存访问，在自动机处理上表现不佳，并且由于内存带宽限制，每个周期只能处理很少的状态转换。另一方面，以内存为中心的架构，如基于dram的美光自动处理器(AP)[2]，由于大量的位级并行性和减少的数据移动/指令处理开销，可以在单个周期内处理高达48K的状态转换。Micron Automata处理器:Micron AP重新利用DRAM列来存储FSM状态，并将行地址用于流输入符号。它实现了同构非确定性有限状态自动机(NFA)，其中每个状态只在一个输入符号上有传入转换。每个状态都有一个标签，这是需要匹配的符号的单一编码。每个输入符号的处理分为两个阶段:(1)状态匹配，其中确定其标签与输入符号匹配的状态;(2)状态转换，其中每个匹配的状态激活其相应的下一个状态。我们探索基于sram的最后一级缓存(llc)作为自动处理的基板，其速度更快，集成在处理器芯片上。缓存容量:一个直接的问题是缓存是否可以存储大型自动机。有趣的是，我们观察到AP牺牲了很大一部分芯片面积来容纳路由矩阵和自动处理所需的其他非内存组件，并且只有与缓存相当的封装密度。将缓存重新用于自动机处理:虽然迁移到SRAM的内存技术优势是显而易见的，但将40-60%的被动LLC芯片面积重新用于大规模并行自动机计算带来了一些挑战。每次LLC访问(~ 20-30个周期@ 4GHz)处理一个输入符号，将导致与基于dram的AP (~ 200 MHz)相当的工作频率，从而抵消了存储技术的优势。进一步提高工作频率只能通过构建一个(1)识别LLC切片内部几何形状的原位计算模型，以及(2)加速符号处理的状态匹配(数组读取)和状态转换(开关+导线传播延迟)阶段。加速状态匹配:这是一个挑战，因为工业LLC子阵列通常有4-8位行共享一个感测放大器。这意味着存储的4-8个状态中只有1个可以匹配每个周期，从而导致总体利用率不足和并行性损失。为了解决这个问题，我们利用了利用状态匹配的空间局域性的感放大器循环技术。加速状态转换:以低区域成本加速状态转换需要设计一个可扩展的互连，该互连可以在同一周期内有效地编码和支持多个状态转换，通常是到相同的目标状态。我们观察到，8T SRAM存储器阵列可以被重新利用，成为一个紧凑的状态转换横杆，用于支持大状态扇入的自动机。扩展到大型自动机:支持状态之间的全对全连接需要非常大且缓慢的交换机。现实世界中的大型自动机通常由几个连接的组件组成，这些组件被分组成紧密连接的分区，它们之间只有很少(8-16)个互连。这促使我们探索一种分层交换机拓扑，其中本地交换机提供丰富的分区内连通性，全局交换机提供稀疏的分区间连通性。为此，我们还设计了一个编译器来自动将状态映射到SRAM数组。我们在来自ANMLZoo[5]和Regex[1]套件的20个不同基准测试中提出并评估了两种架构和映射策略，一种针对性能进行了优化，另一种针对空间进行了优化。我们还演示了浏览器前端解析活动的加速，作为一个案例研究，其中FSA计算是一个瓶颈，占用了网页加载时间的40%。性能优化和空间优化设计分别比美光的AP加速15倍和9倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cache Automaton: Repurposing Caches for Automata Processing

Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing due to ir-regular memory accesses and can process only few state transitions every cycle due to memory bandwidth limitations. On the other hand, memory-centric architectures such as the DRAM-based Micron Automata Processor (AP) [2] can process up to 48K state transitions in a single cycle due to massive bit-level parallelism and reduced data movement/instruction processing overheads. Micron Automata Processor: The Micron AP re-purposes DRAM columns to store FSM states and the row address to stream input symbols. It implements homogeneous non-deterministic finite state automata (NFA), where each state has incoming transitions only on one input symbol. Each state has a label, which is the one-hot encoding of the symbols it is required to match against. Each input symbol is processed in two phases: (1) state-match, where the states whose label matches the input symbol are determined and (2) state-transition, where each of the matched states activates their corresponding next states. We explore SRAM-based last-level caches (LLCs) as a substrate for automata processing that are faster and integrated on processor dies. Cache capacity: One immediate concern is whether caches can store large automata. Interestingly, we observe that AP sacrifices a huge fraction of die area to accommodate the routing matrix and other non-memory components required for automata processing and only has a packing density comparable to caches. Repurposing caches for automata processing: While the memory technology benefits of moving to SRAM are apparent, repurposing the 40-60% passive LLC die area for massively parallel automata computation comes with several challenges. Processing an input symbol every LLC access (∼20-30 cycles @ 4GHz), would lead to an operating frequency comparable to DRAM-based AP (∼200 MHz), negating the memory technology benefits. Increasing operating frequency further can be made possible only by architecting an (1) in-situ computation model which is cognizant of internal geometry of LLC slices, and (2) accelerating state-match (array read) and state-transition (switch+wire propagation delay) phases of symbol processing. Accelerating state-match: This is challenging because industrial LLC subarrays typically have 4-8 bitlines sharing a sense-amp. This means that only 1 out of 4-8 states stored can match every cycle leading to gross under-utilization and loss of parallelism. To solve this, we leverage sense-amp cycling techniques that exploit spatial locality of state-matches. Accelerating state-transition: Accelerating state-transition at low-area cost requires the design of a scalable interconnect that efficiently encodes and supports multiple state-transitions on the same cycle, often to the same destination state. We observe that a 8T SRAM memory array can be re-purposed to become a compact state-transition crossbar for automata supporting large fan-in for states. Scaling to large automata: Supporting all-to-all connectivity between states requires prohibitively large and slow switches. Large real-world automata are typically composed of several connected components, grouped into densely connected partitions with only few (8-16) interconnections between them. This motivated us to explore a hierarchical switch topology with local switches providing rich intra-partition connectivity and global switches providing sparse inter-partition connectivity. To this end, we also design a compiler that automates the mapping of states into SRAM arrays. We propose and evaluate two architectures and mapping policies, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks from ANMLZoo [5] and Regex [1] suites. We also demonstrate acceleration of parsing activities in browser front-end as a case study where FSA computations are a bottleneck taking up to 40% of the loading time of web pages [3]. The performance optimized and space optimized designs provide a speedup of 15× and 9× over Micron's AP respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量