2012 IEEE 20th Annual Symposium on High-Performance Interconnects最新文献

英文中文

Tutorials - HOTI 2012 教程- HOTI 2012

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-09-13 DOI: 10.1109/HOTI.2012.25

S. Seetharama, M. Cohen, S. Sengupta, D. Panda, L. Paraschis

This keynotes discusses the following: Hands-on Tutorial on Software-Defined Networking; Interconnection Networks for Cloud Data Centers; Designing Scientific, Enterprise, and Cloud Computing Systems with InfiniBand and High-Speed Ethernet: Current Status and Trends; The Evolution of Network Architecture towards CloudCentric Applications.

本主题演讲讨论以下内容:软件定义网络的实践教程;面向云数据中心的互联网络利用ib和高速以太网设计科学、企业和云计算系统:现状和趋势网络架构向以云为中心的应用的演变。

引用次数: 0

Weighted Differential Scheduler 加权微分调度

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.12

H. Eberle, W. Olesinski

The Weighted Differential Scheduler (WDS) is a new scheduling discipline for accessing shared resources. The work described here was motivated by the need for a simple weighted scheduler for a network switch where multiple packet flows are competing for an output port. The scheme can be implemented with simple arithmetic logic and finite state machines. We are describing several versions of WDS that can merge two or more flows. An analysis reveals that WDS has lower jitter than any other weighted scheduler known to us.

加权差分调度(WDS)是一种新的用于访问共享资源的调度规则。这里描述的工作的动机是需要一个简单的加权调度器，用于多个数据包流竞争一个输出端口的网络交换机。该方案可以用简单的算术逻辑和有限状态机实现。我们正在描述可以合并两个或更多流的WDS的几个版本。分析表明，WDS比我们已知的任何其他加权调度器具有更低的抖动。

引用次数: 2

Performance Evaluation of Open MPI on Cray XE/XK Systems Open MPI在Cray XE/XK系统上的性能评价

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.11

S. Gutierrez, N. Hjelm, Manjunath Gorentla Venkata, R. Graham

Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of Open MPI, however, lack support for the Cray XE6 and XK6 architectures -- both of which use the Gemini System Interconnect. In this paper, we present extensions to natively support these architectures within Open MPI, describe and propose solutions for performance and scalability bottlenecks, and provide an extensive evaluation of our implementation, which is the first completely open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes. Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPI's. Micro-benchmark results show short-data 1-byte and 1,024-byte message latencies of 1.20 μs and 4.13 μs, which are 10.00% and 39.71% better than the vendor-supplied MPI's, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPI's bandwidth at the same message size. Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores -- where we exhibited similar performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPI's achieved parallel efficiency.

Open MPI是一种广泛使用的MPI-2标准的开源实现，它支持各种平台和互连。然而，当前版本的Open MPI缺乏对Cray XE6和XK6架构的支持，这两种架构都使用Gemini System Interconnect。在本文中，我们提出了在Open MPI中本地支持这些架构的扩展，描述并提出了性能和可扩展性瓶颈的解决方案，并对我们的实现进行了广泛的评估，这是在49,152个进程中使用的Cray XE/XK系统系列中第一个完全开源的MPI实现。应用和微基准测试结果表明，我们实现的性能和扩展特性与供应商提供的MPI相似。微基准测试结果表明，短数据1字节和1024字节的消息延迟分别为1.20 μs和4.13 μs，比厂商提供的MPI分别提高10.00%和39.71%。我们的实现在8 MB时实现了5.32 GB/s的带宽，这与供应商提供的MPI在相同消息大小下的带宽相似。我们还选择了两个Sequoia基准应用程序LAMMPS和AMG2006来评估我们在高达49,152核的规模下的实现——与供应商提供的MPI实现相比，我们表现出了相似的性能和扩展特性。使用Open MPI, LAMMPS在49152个内核上实现了88.20%的并行效率，这与供应商提供的MPI实现的并行效率相当。

{"title":"Performance Evaluation of Open MPI on Cray XE/XK Systems","authors":"S. Gutierrez, N. Hjelm, Manjunath Gorentla Venkata, R. Graham","doi":"10.1109/HOTI.2012.11","DOIUrl":"https://doi.org/10.1109/HOTI.2012.11","url":null,"abstract":"Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of Open MPI, however, lack support for the Cray XE6 and XK6 architectures -- both of which use the Gemini System Interconnect. In this paper, we present extensions to natively support these architectures within Open MPI, describe and propose solutions for performance and scalability bottlenecks, and provide an extensive evaluation of our implementation, which is the first completely open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes. Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPI's. Micro-benchmark results show short-data 1-byte and 1,024-byte message latencies of 1.20 μs and 4.13 μs, which are 10.00% and 39.71% better than the vendor-supplied MPI's, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPI's bandwidth at the same message size. Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores -- where we exhibited similar performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPI's achieved parallel efficiency.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124460078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Caliper: Precise and Responsive Traffic Generator 卡尺:精确和响应流量发生器

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.16

Monia Ghobadi, Geoffrey Salmon, Y. Ganjali, Martin Labrecque, J. Steffan

This paper presents Caliper, a highly-accurate packet injection tool that generates precise and responsive traffic. Caliper takes live packets generated on a host computer and transmits them onto a gigabit Ethernet network with precise inter-transmission times. Existing software traffic generators rely on generic Network Interface Cards which, as we demonstrate, do not provide high-precision timing guarantees. Hence, performing valid and convincing experiments becomes difficult or impossible in the context of time-sensitive network experiments. Our evaluations show that Caliper is able to reproduce packet inter-transmission times from a given arbitrary distribution while capturing the closed-loop feedback of TCP sources. Specifically, we demonstrate that Caliper provides three orders of magnitude better precision compared to commodity NIC: with requested traffic rates up to the line rate, Caliper incurs an error of 8 ns or less in packet transmission times. Furthermore, we explore Caliper's ability to integrate with existing network simulators to project simulated traffic characteristics into a real network environment. Caliper is freely available online.

本文介绍了Caliper，一种高精度的数据包注入工具，可以生成精确的响应流量。Caliper接收主机上生成的实时数据包，并以精确的传输时间将它们传输到千兆以太网。现有的软件流量生成器依赖于通用的网络接口卡，正如我们所展示的，它不能提供高精度的定时保证。因此，在时间敏感网络实验的背景下，进行有效和令人信服的实验变得困难或不可能。我们的评估表明，Caliper能够在捕获TCP源的闭环反馈的同时，从给定的任意分布中再现分组间传输时间。具体来说，我们证明了与商品NIC相比，Caliper提供了三个数量级的精度:当请求的流量速率高达线路速率时，Caliper在数据包传输时间中产生的误差为8 ns或更少。此外，我们探索了Caliper与现有网络模拟器集成的能力，以将模拟的流量特征投影到真实的网络环境中。卡尺是免费在线提供。

{"title":"Caliper: Precise and Responsive Traffic Generator","authors":"Monia Ghobadi, Geoffrey Salmon, Y. Ganjali, Martin Labrecque, J. Steffan","doi":"10.1109/HOTI.2012.16","DOIUrl":"https://doi.org/10.1109/HOTI.2012.16","url":null,"abstract":"This paper presents Caliper, a highly-accurate packet injection tool that generates precise and responsive traffic. Caliper takes live packets generated on a host computer and transmits them onto a gigabit Ethernet network with precise inter-transmission times. Existing software traffic generators rely on generic Network Interface Cards which, as we demonstrate, do not provide high-precision timing guarantees. Hence, performing valid and convincing experiments becomes difficult or impossible in the context of time-sensitive network experiments. Our evaluations show that Caliper is able to reproduce packet inter-transmission times from a given arbitrary distribution while capturing the closed-loop feedback of TCP sources. Specifically, we demonstrate that Caliper provides three orders of magnitude better precision compared to commodity NIC: with requested traffic rates up to the line rate, Caliper incurs an error of 8 ns or less in packet transmission times. Furthermore, we explore Caliper's ability to integrate with existing network simulators to project simulated traffic characteristics into a real network environment. Caliper is freely available online.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"51 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128871285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Rx Stack Accelerator for 10 GbE Integrated NIC Rx堆栈加速器用于10gbe集成网卡

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.18

F. Abel, C. Hagleitner, Fabrice Verplanken

The miniaturization of CMOS technology has reached a scale at which server processors are starting to integrate multi-gigabit network interface controllers (NIC). While transistors are becoming cheap and abundant in solid-state circuits, they remain at a premium on a processor die if they do not contribute to increase the number of cores and caches. Therefore, an integrated NIC (iNIC) must provide high networking performance under high logic density and low power dissipation. This paper describes the design of an integrated accelerator to offload computation-intensive protocol-processing tasks. The accelerator combines the concepts of the transport-triggered architecture with a programmable finite-state machine to deliver high instruction-level parallelism, efficient multiway branching and flexibility. The flexibility is key to adapt to protocol changes and address new applications. This accelerator was used in the construction of a 10 GbE iNIC in 45-nm CMOS technology. The ratio of performance (15 Mfps - 20 Gb/s Tput per port) to area (0.7 mm2) and the power consumption (0.15 W) of this accelerator were core enablers for constructing a processor compute complex with four iNICs.

CMOS技术的小型化已经达到了服务器处理器开始集成千兆网络接口控制器(NIC)的规模。虽然晶体管在固态电路中变得越来越便宜和丰富，但如果它们不能增加核心和缓存的数量，它们在处理器芯片上仍然是昂贵的。因此，集成网卡必须在高逻辑密度、低功耗的前提下，提供良好的网络性能。本文描述了一个集成加速器的设计，以卸载计算密集型的协议处理任务。该加速器将传输触发架构的概念与可编程有限状态机相结合，以提供高指令级并行性、高效的多路分支和灵活性。灵活性是适应协议更改和处理新应用程序的关键。该加速器用于构建采用45纳米CMOS技术的10 GbE智能网卡。该加速器的性能(每个端口15 Mfps—20 Gb/s输出)与面积(0.7 mm2)的比率和功耗(0.15 W)是构建包含四个inic的处理器计算综合体的核心推动因素。

{"title":"Rx Stack Accelerator for 10 GbE Integrated NIC","authors":"F. Abel, C. Hagleitner, Fabrice Verplanken","doi":"10.1109/HOTI.2012.18","DOIUrl":"https://doi.org/10.1109/HOTI.2012.18","url":null,"abstract":"The miniaturization of CMOS technology has reached a scale at which server processors are starting to integrate multi-gigabit network interface controllers (NIC). While transistors are becoming cheap and abundant in solid-state circuits, they remain at a premium on a processor die if they do not contribute to increase the number of cores and caches. Therefore, an integrated NIC (iNIC) must provide high networking performance under high logic density and low power dissipation. This paper describes the design of an integrated accelerator to offload computation-intensive protocol-processing tasks. The accelerator combines the concepts of the transport-triggered architecture with a programmable finite-state machine to deliver high instruction-level parallelism, efficient multiway branching and flexibility. The flexibility is key to adapt to protocol changes and address new applications. This accelerator was used in the construction of a 10 GbE iNIC in 45-nm CMOS technology. The ratio of performance (15 Mfps - 20 Gb/s Tput per port) to area (0.7 mm2) and the power consumption (0.15 W) of this accelerator were core enablers for constructing a processor compute complex with four iNICs.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128009425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems InfiniBand FDR和40GigE RoCE在高性能计算和云计算系统上的性能分析与评估

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.19

Jérôme Vienne, Jitong Chen, Md. Wasi-ur-Rahman, Nusrat S. Islam, H. Subramoni, D. Panda

Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middleware (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However, no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).In this paper we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middleware. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.

高性能计算(HPC)系统和云的通信接口一直在不断发展，以满足HPC应用程序和云计算中间件(例如Hadoop)对其不断增长的通信需求。PCIe接口现在可以提供高达128 Gbps (Gen3)的速度，高性能互连(10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR，融合以太网上的10/40 GigE RDMA)能够提供10到54 Gbps的速度。然而，之前的研究没有证明HPC /云计算领域的最终用户通过使用新一代的互连可以获得多少好处，或者一种类型的互连(如IB)与另一种类型的互连(如RoCE)相比表现如何。在本文中，我们评估了基于HPC和云计算工作负载的新型PCIe Gen3接口的各种高性能互连。我们在不同层次上进行的综合分析，提供了这些现代互连对高性能计算应用程序和云计算中间件性能的影响的全球范围。实验结果表明，最新的InfiniBand FDR互连为高性能计算和云计算应用提供了最佳性能。

{"title":"Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems","authors":"Jérôme Vienne, Jitong Chen, Md. Wasi-ur-Rahman, Nusrat S. Islam, H. Subramoni, D. Panda","doi":"10.1109/HOTI.2012.19","DOIUrl":"https://doi.org/10.1109/HOTI.2012.19","url":null,"abstract":"Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middleware (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However, no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).In this paper we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middleware. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129790033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Occupancy Sampling for Terabit CEE Switches 太比特CEE交换机的占用抽样

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.14

F. Neeser, Nikolaos Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, Kenneth M. Valk, C. Basso

One consequential feature of Converged Enhanced Ethernet (CEE) is loss lessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN). We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch, however, as future switches scale to higher port counts and link speeds, purely output-queued or shared-memory architectures lead to excessive memory bandwidth requirements, moreover, PFC typically requires dedicated buffers per input. Our objective is to complement PFC's coarse per-port/priority granularity with QCN's per-flow control. By detecting buffer overload early, QCN can drastically reduce PFC's side effects. We install QCN congestion points (CPs) at input buffers with virtual output queues and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims. Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state. For CPs with arbitrarily scheduled buffers, QCN-OSis shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness.

融合增强型以太网(CEE)的一个重要特征是通过L2优先级流控制(PFC)和量化拥塞通知(QCN)实现的无损失。我们关注QCN及其在识别输入缓冲CEE开关中的充血性流方面的有效性。QCN假设一个理想的、输出排队的交换机，然而，随着未来交换机扩展到更高的端口数量和链路速度，纯粹的输出排队或共享内存架构会导致内存带宽需求过大，此外，PFC通常需要每个输入专用缓冲区。我们的目标是用QCN的每流控制来补充PFC的粗端口/优先级粒度。通过早期检测缓冲区过载，QCN可以大大减少PFC的副作用。我们在带有虚拟输出队列的输入缓冲区安装了QCN拥塞点(CPs)，并证明基于到达的标记不能正确区分罪魁祸首和受害者。我们的主要贡献是占用抽样(QCN-OS)，这是一种新颖的、与qcn兼容的标记方案。我们的重点是随机占用抽样，这是一种不需要任何每流状态的实用方法。对于具有任意调度缓冲区的CPs, QCN-OSis显示可以正确识别充血症流，提高缓冲区利用率、切换效率和公平性。

{"title":"Occupancy Sampling for Terabit CEE Switches","authors":"F. Neeser, Nikolaos Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, Kenneth M. Valk, C. Basso","doi":"10.1109/HOTI.2012.14","DOIUrl":"https://doi.org/10.1109/HOTI.2012.14","url":null,"abstract":"One consequential feature of Converged Enhanced Ethernet (CEE) is loss lessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN). We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch, however, as future switches scale to higher port counts and link speeds, purely output-queued or shared-memory architectures lead to excessive memory bandwidth requirements, moreover, PFC typically requires dedicated buffers per input. Our objective is to complement PFC's coarse per-port/priority granularity with QCN's per-flow control. By detecting buffer overload early, QCN can drastically reduce PFC's side effects. We install QCN congestion points (CPs) at input buffers with virtual output queues and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims. Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state. For CPs with arbitrarily scheduled buffers, QCN-OSis shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114370577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

ParaSplit: A Scalable Architecture on FPGA for Terabit Packet Classification 基于FPGA的太比特分组分类可扩展架构

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.17

Jeffrey Fong, Xiang Wang, Yaxuan Qi, Jun Li, Weirong Jiang

Packet classification is a fundamental enabling function for various applications in switches, routers and firewalls. Due to their performance and scalability limitations, current packet classification solutions are insufficient in ad-dressing the challenges from the growing network bandwidth and the increasing number of new applications. This paper presents a scalable parallel architecture, named Para Split, for high-performance packet classification. We propose a rule set partitioning algorithm based on range-point conversion to reduce the overall memory requirement. We further optimize the partitioning by applying the Simulated Annealing technique. We implement the architecture on a Field Programmable Gate Array (FPGA) to achieve high throughput by exploiting the abundant parallelism in the hardware. Evaluation using real-life data sets including Open Flow-like 11-tuple rules shows that Para Split achieves significant reduction in memory requirement, compared with the-state-of-the-art algorithms such as Hyper Split [6] and EffiCuts [8]. Because of the memory efficiency of Para Split, our FPGA design can support in the on-chip memory multiple engines, each of which contains up to 10K complex rules. As a result, the architecture with multiple Para Split engines in parallel can achieve up to Terabit per second throughput for large and complex rule sets on a single FPGA device.

包分类是交换机、路由器和防火墙中各种应用的基本功能。由于性能和可扩展性的限制，当前的分组分类解决方案不足以应对日益增长的网络带宽和越来越多的新应用带来的挑战。本文提出了一种可扩展的并行结构，称为Para Split，用于高性能的数据包分类。提出了一种基于距离-点转换的规则集划分算法，以降低整体内存需求。我们利用模拟退火技术进一步优化了分区。我们在现场可编程门阵列(FPGA)上实现该体系结构，利用硬件中丰富的并行性来实现高吞吐量。使用实际数据集(包括类似Open flow的11元组规则)进行评估表明，与Hyper Split[6]和EffiCuts[8]等最先进的算法相比，Para Split显著降低了内存需求。由于Para Split的内存效率，我们的FPGA设计可以在片上存储器中支持多个引擎，每个引擎包含多达10K的复杂规则。因此，具有多个并行Para Split引擎的架构可以在单个FPGA设备上为大型复杂规则集实现高达每秒太比特的吞吐量。

{"title":"ParaSplit: A Scalable Architecture on FPGA for Terabit Packet Classification","authors":"Jeffrey Fong, Xiang Wang, Yaxuan Qi, Jun Li, Weirong Jiang","doi":"10.1109/HOTI.2012.17","DOIUrl":"https://doi.org/10.1109/HOTI.2012.17","url":null,"abstract":"Packet classification is a fundamental enabling function for various applications in switches, routers and firewalls. Due to their performance and scalability limitations, current packet classification solutions are insufficient in ad-dressing the challenges from the growing network bandwidth and the increasing number of new applications. This paper presents a scalable parallel architecture, named Para Split, for high-performance packet classification. We propose a rule set partitioning algorithm based on range-point conversion to reduce the overall memory requirement. We further optimize the partitioning by applying the Simulated Annealing technique. We implement the architecture on a Field Programmable Gate Array (FPGA) to achieve high throughput by exploiting the abundant parallelism in the hardware. Evaluation using real-life data sets including Open Flow-like 11-tuple rules shows that Para Split achieves significant reduction in memory requirement, compared with the-state-of-the-art algorithms such as Hyper Split [6] and EffiCuts [8]. Because of the memory efficiency of Para Split, our FPGA design can support in the on-chip memory multiple engines, each of which contains up to 10K complex rules. As a result, the architecture with multiple Para Split engines in parallel can achieve up to Terabit per second throughput for large and complex rule sets on a single FPGA device.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130154550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT) 基于FPGA硬件的高频交易低延迟库

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

Pub Date : 2012-08-22 DOI: 10.1109/HOTI.2012.15

J. Lockwood, Adwait Gupte, Nishit Mehta, Michaela Blott, T. English, K. Vissers

Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters. The high and unpredictable latency of these systems has led the trading world to explore alternative "hybrid" architectures with hardware acceleration. In this paper, we survey existing solutions and describe how FPGAs are being used in electronic trading to approach the goal of zero latency. We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed end-to-end latency of 1μs - up to two orders of magnitude lower than comparable software implementations.

当前的高频交易(HFT)平台通常是在具有高性能网络适配器的计算机上的软件中实现的。这些系统的高且不可预测的延迟导致交易世界探索具有硬件加速的替代“混合”架构。在本文中，我们调查了现有的解决方案，并描述了如何在电子交易中使用fpga来接近零延迟的目标。我们提出了一个FPGA IP库，实现了网络、I/O、内存接口和金融协议解析器。该图书馆提供了预先构建的基础设施，加速了新的金融应用程序的开发和验证。我们在定制的1U FPGA设备上使用IP库开发了一个示例金融应用程序。该应用程序支持10Gb/s以太网线路速率，端到端固定延迟为1μs，比同类软件实现低两个数量级。

引用次数: 77

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 IEEE 20th Annual Symposium on High-Performance Interconnects

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀