Optimizing many-field packet classification on FPGA, multi-core general purpose processor, and GPU

Yun Qu, Hao Zhang, Shijie Zhou, V. Prasanna
{"title":"Optimizing many-field packet classification on FPGA, multi-core general purpose processor, and GPU","authors":"Yun Qu, Hao Zhang, Shijie Zhou, V. Prasanna","doi":"10.1109/ANCS.2015.7110123","DOIUrl":null,"url":null,"abstract":"Due to the rapid growth of Internet, there is an increasing need for efficiently classifying packets with many header fields in large rule sets. For example, in Software Defined Networking (SDN), the OpenFlow table lookup can require 15 packet header fields to be examined. In this paper, we present several decomposition-based packet classification implementations with efficient optimization techniques. In the searching phase, packet headers are split or combined. In the merging phase, the partial searching results from all the fields are merged to generate the final result. We prototype our implementations on state-of-the-art Field Programmable Gate Array (FPGA), multi-core General Purpose Processor (GPP), and Graphics Processing Unit (GPU). On FPGA, we propose two optimization techniques to divide generic ranges; modular processing elements are constructed and concatenated into a systolic array. On multi-core GPP, we parallelize both the searching and merging phases using parallel program threads. On the GPU-accelerated platform, we minimize branch divergence and reduce the data communication overhead. Experimental results show that 500Million Packets Per Second (MPPS) throughput and 3μs latency can be achieved for 1:5K rule sets on FPGA. We achieve 14:7MPPS throughput and 30:5MPPS throughput for 32K rule sets on multi-core GPP and GPU-accelerated platforms, respectively. As a heterogeneous solution, our GPU-accelerated packet classier shows 2x speedup compared to the implementation using multi-core GPP only. Compared with prior works, our designs can match long packet headers against very complex rule sets.","PeriodicalId":186232,"journal":{"name":"2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ANCS.2015.7110123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38

Abstract

Due to the rapid growth of Internet, there is an increasing need for efficiently classifying packets with many header fields in large rule sets. For example, in Software Defined Networking (SDN), the OpenFlow table lookup can require 15 packet header fields to be examined. In this paper, we present several decomposition-based packet classification implementations with efficient optimization techniques. In the searching phase, packet headers are split or combined. In the merging phase, the partial searching results from all the fields are merged to generate the final result. We prototype our implementations on state-of-the-art Field Programmable Gate Array (FPGA), multi-core General Purpose Processor (GPP), and Graphics Processing Unit (GPU). On FPGA, we propose two optimization techniques to divide generic ranges; modular processing elements are constructed and concatenated into a systolic array. On multi-core GPP, we parallelize both the searching and merging phases using parallel program threads. On the GPU-accelerated platform, we minimize branch divergence and reduce the data communication overhead. Experimental results show that 500Million Packets Per Second (MPPS) throughput and 3μs latency can be achieved for 1:5K rule sets on FPGA. We achieve 14:7MPPS throughput and 30:5MPPS throughput for 32K rule sets on multi-core GPP and GPU-accelerated platforms, respectively. As a heterogeneous solution, our GPU-accelerated packet classier shows 2x speedup compared to the implementation using multi-core GPP only. Compared with prior works, our designs can match long packet headers against very complex rule sets.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在FPGA、多核通用处理器和GPU上优化多域分组分类
由于Internet的快速发展,对大型规则集中具有多个报头字段的数据包进行高效分类的需求日益增加。例如,在软件定义网络(SDN)中,OpenFlow表查找可能需要检查15个数据包报头字段。在本文中,我们提出了几种基于分解的数据包分类实现和高效的优化技术。在搜索阶段,对报文头进行拆分或合并。在合并阶段,将所有字段的部分搜索结果合并生成最终结果。我们在最先进的现场可编程门阵列(FPGA),多核通用处理器(GPP)和图形处理单元(GPU)上实现原型。在FPGA上,我们提出了两种优化技术来划分通用范围;模块处理元素被构造并连接到一个收缩数组中。在多核GPP上,我们使用并行程序线程并行化搜索和归并阶段。在gpu加速平台上,我们最大限度地减少了分支发散,减少了数据通信开销。实验结果表明,对于1:5K的规则集,FPGA可以实现5亿个数据包每秒(MPPS)的吞吐量和3μs的延迟。我们在多核GPP和gpu加速平台上分别实现了14:7MPPS和30:5MPPS吞吐量的32K规则集。作为一种异构解决方案,我们的gpu加速数据包分类器与仅使用多核GPP的实现相比,速度提高了2倍。与以往的工作相比,我们的设计可以匹配长包头和非常复杂的规则集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
qSDS: A QoS-Aware I/O scheduling framework towards software defined storage Transparent cloud access performance augmentation via an MPTCP-LISP connection proxy Packet classification using a bloom filter in a leaf-pushing area-based quad-trie Parsing application layer protocol with commodity hardware for SDN Recent trends in virtual network functions acceleration for carrier clouds
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1