Optimizing many-field packet classification on FPGA, multi-core general purpose processor, and GPU

2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS) Pub Date : 2015-05-07 DOI:10.1109/ANCS.2015.7110123

Yun Qu, Hao Zhang, Shijie Zhou, V. Prasanna

{"title":"Optimizing many-field packet classification on FPGA, multi-core general purpose processor, and GPU","authors":"Yun Qu, Hao Zhang, Shijie Zhou, V. Prasanna","doi":"10.1109/ANCS.2015.7110123","DOIUrl":null,"url":null,"abstract":"Due to the rapid growth of Internet, there is an increasing need for efficiently classifying packets with many header fields in large rule sets. For example, in Software Defined Networking (SDN), the OpenFlow table lookup can require 15 packet header fields to be examined. In this paper, we present several decomposition-based packet classification implementations with efficient optimization techniques. In the searching phase, packet headers are split or combined. In the merging phase, the partial searching results from all the fields are merged to generate the final result. We prototype our implementations on state-of-the-art Field Programmable Gate Array (FPGA), multi-core General Purpose Processor (GPP), and Graphics Processing Unit (GPU). On FPGA, we propose two optimization techniques to divide generic ranges; modular processing elements are constructed and concatenated into a systolic array. On multi-core GPP, we parallelize both the searching and merging phases using parallel program threads. On the GPU-accelerated platform, we minimize branch divergence and reduce the data communication overhead. Experimental results show that 500Million Packets Per Second (MPPS) throughput and 3μs latency can be achieved for 1:5K rule sets on FPGA. We achieve 14:7MPPS throughput and 30:5MPPS throughput for 32K rule sets on multi-core GPP and GPU-accelerated platforms, respectively. As a heterogeneous solution, our GPU-accelerated packet classier shows 2x speedup compared to the implementation using multi-core GPP only. Compared with prior works, our designs can match long packet headers against very complex rule sets.","PeriodicalId":186232,"journal":{"name":"2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ANCS.2015.7110123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

Due to the rapid growth of Internet, there is an increasing need for efficiently classifying packets with many header fields in large rule sets. For example, in Software Defined Networking (SDN), the OpenFlow table lookup can require 15 packet header fields to be examined. In this paper, we present several decomposition-based packet classification implementations with efficient optimization techniques. In the searching phase, packet headers are split or combined. In the merging phase, the partial searching results from all the fields are merged to generate the final result. We prototype our implementations on state-of-the-art Field Programmable Gate Array (FPGA), multi-core General Purpose Processor (GPP), and Graphics Processing Unit (GPU). On FPGA, we propose two optimization techniques to divide generic ranges; modular processing elements are constructed and concatenated into a systolic array. On multi-core GPP, we parallelize both the searching and merging phases using parallel program threads. On the GPU-accelerated platform, we minimize branch divergence and reduce the data communication overhead. Experimental results show that 500Million Packets Per Second (MPPS) throughput and 3μs latency can be achieved for 1:5K rule sets on FPGA. We achieve 14:7MPPS throughput and 30:5MPPS throughput for 32K rule sets on multi-core GPP and GPU-accelerated platforms, respectively. As a heterogeneous solution, our GPU-accelerated packet classier shows 2x speedup compared to the implementation using multi-core GPP only. Compared with prior works, our designs can match long packet headers against very complex rule sets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在FPGA、多核通用处理器和GPU上优化多域分组分类

由于Internet的快速发展，对大型规则集中具有多个报头字段的数据包进行高效分类的需求日益增加。例如，在软件定义网络(SDN)中，OpenFlow表查找可能需要检查15个数据包报头字段。在本文中，我们提出了几种基于分解的数据包分类实现和高效的优化技术。在搜索阶段，对报文头进行拆分或合并。在合并阶段，将所有字段的部分搜索结果合并生成最终结果。我们在最先进的现场可编程门阵列(FPGA)，多核通用处理器(GPP)和图形处理单元(GPU)上实现原型。在FPGA上，我们提出了两种优化技术来划分通用范围;模块处理元素被构造并连接到一个收缩数组中。在多核GPP上，我们使用并行程序线程并行化搜索和归并阶段。在gpu加速平台上，我们最大限度地减少了分支发散，减少了数据通信开销。实验结果表明，对于1:5K的规则集，FPGA可以实现5亿个数据包每秒(MPPS)的吞吐量和3μs的延迟。我们在多核GPP和gpu加速平台上分别实现了14:7MPPS和30:5MPPS吞吐量的32K规则集。作为一种异构解决方案，我们的gpu加速数据包分类器与仅使用多核GPP的实现相比，速度提高了2倍。与以往的工作相比，我们的设计可以匹配长包头和非常复杂的规则集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS)

自引率

0.00%

发文量