This paper presents a novel cell architecture for evolvable systolic arrays. HexCell is a tile-able processing element with a hexagonal shape that can be implemented and dynamically reconfigured on field-programmable gate arrays (FPGAs). The cell contains a functional unit, three input ports, and three output ports. It supports two concurrent configuration schemes: dynamic partial reconfiguration (DPR), where the functional unit is partially reconfigured at run time, and virtual reconfiguration circuit (VRC), where the cell output port bypasses one of the input data or selects the functional unit output. Hence, HexCell combines the merits of DPR and VRC including resource-awareness, reconfiguration speed and routing flexibility. In addition, the cell structure supports pipelining and data synchronization for achieving high throughput for data-intensive applications like image processing. A HexCell is represented by a binary string (chromosome) that encodes the cell's function and the output selections. Our developed evolvable HexCell array supports more inputs and outputs, a variety of possible datapaths, and has faster reconfiguration, compared to the state-of-the-art systolic array while maintaining the same resource utilization. Moreover, by using the same genetic algorithm on the two systolic arrays, results show that the HexCell array has higher throughput and can evolve faster than state-of-the-art array.
{"title":"HexCell: a Hexagonal Cell for Evolvable Systolic Arrays on FPGAs: (Abstract Only)","authors":"F. Hussein, Luka Daoud, N. Rafla","doi":"10.1145/3174243.3174988","DOIUrl":"https://doi.org/10.1145/3174243.3174988","url":null,"abstract":"This paper presents a novel cell architecture for evolvable systolic arrays. HexCell is a tile-able processing element with a hexagonal shape that can be implemented and dynamically reconfigured on field-programmable gate arrays (FPGAs). The cell contains a functional unit, three input ports, and three output ports. It supports two concurrent configuration schemes: dynamic partial reconfiguration (DPR), where the functional unit is partially reconfigured at run time, and virtual reconfiguration circuit (VRC), where the cell output port bypasses one of the input data or selects the functional unit output. Hence, HexCell combines the merits of DPR and VRC including resource-awareness, reconfiguration speed and routing flexibility. In addition, the cell structure supports pipelining and data synchronization for achieving high throughput for data-intensive applications like image processing. A HexCell is represented by a binary string (chromosome) that encodes the cell's function and the output selections. Our developed evolvable HexCell array supports more inputs and outputs, a variety of possible datapaths, and has faster reconfiguration, compared to the state-of-the-art systolic array while maintaining the same resource utilization. Moreover, by using the same genetic algorithm on the two systolic arrays, results show that the HexCell array has higher throughput and can evolve faster than state-of-the-art array.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131024262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Routing of nets is one of the most time-consuming steps in the FPGA design flow. Existing works have described ways of accelerating the process through parallelization. However, only some of them are deterministic, and determinism is often achieved at the cost of speedup. In this paper, we propose ParaDRo, a parallel FPGA router based on spatial partitioning that achieves deterministic results while maintaining reasonable speedup. Existing spatial partitioning based routers do not scale well because the number of nets that can fully utilize all processors reduces as the number of processors increases. In addition, they route nets that are within a spatial partition sequentially. ParaDRo mitigates this problem by scheduling nets within a spatial partition to be routed in parallel if they do not have overlapping bounding boxes. Further parallelism is extracted by decomposing multi-sink nets into single-sink nets to minimize the amount of bounding box overlaps and increase the number of nets that can be routed in parallel. These improvements enable ParaDRo to achieve an average speedup of 5.4X with 8 threads with minimal impact on the quality of results.
{"title":"ParaDRo: A Parallel Deterministic Router Based on Spatial Partitioning and Scheduling","authors":"Chin Hau Hoo, Akash Kumar","doi":"10.1145/3174243.3174246","DOIUrl":"https://doi.org/10.1145/3174243.3174246","url":null,"abstract":"Routing of nets is one of the most time-consuming steps in the FPGA design flow. Existing works have described ways of accelerating the process through parallelization. However, only some of them are deterministic, and determinism is often achieved at the cost of speedup. In this paper, we propose ParaDRo, a parallel FPGA router based on spatial partitioning that achieves deterministic results while maintaining reasonable speedup. Existing spatial partitioning based routers do not scale well because the number of nets that can fully utilize all processors reduces as the number of processors increases. In addition, they route nets that are within a spatial partition sequentially. ParaDRo mitigates this problem by scheduling nets within a spatial partition to be routed in parallel if they do not have overlapping bounding boxes. Further parallelism is extracted by decomposing multi-sink nets into single-sink nets to minimize the amount of bounding box overlaps and increase the number of nets that can be routed in parallel. These improvements enable ParaDRo to achieve an average speedup of 5.4X with 8 threads with minimal impact on the quality of results.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124995582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-Level Synthesis (HLS) promises improved designer productivity, but requires a debug ecosystem that allows designers to debug in the context of the original source code. Recent work has presented in-system debug frameworks where instrumentation added to the design collects trace data as the circuit runs, and a software tool that allows the user to replay the execution using the captured data. When searching for the root cause of a bug, the designer may need to modify the instrumentation to collect data from a new part of the design, requiring a lengthy recompile. In this paper, we propose a flexible debug overlay family that provides software-like debug turn-around times for HLS generated circuits. At compile time, the overlay is added to the design and compiled. At debug time, the overlay can be configured many times to implement specific debug scenarios without a recompilation. This paper first outlines a number of "capabilities" that such an overlay should have, and then describes architectural support for each of these capabilities. The cheapest overlay variant allows selective variable tracing with only a 1.7% increase in area overhead from the baseline debug instrumentation, while the deluxe variant offers 2x-7x improvement in trace buffer memory utilization with conditional buffer freeze support.
{"title":"Architecture Exploration for HLS-Oriented FPGA Debug Overlays","authors":"Al-Shahna Jamal, Jeffrey B. Goeders, S. Wilton","doi":"10.1145/3174243.3174254","DOIUrl":"https://doi.org/10.1145/3174243.3174254","url":null,"abstract":"High-Level Synthesis (HLS) promises improved designer productivity, but requires a debug ecosystem that allows designers to debug in the context of the original source code. Recent work has presented in-system debug frameworks where instrumentation added to the design collects trace data as the circuit runs, and a software tool that allows the user to replay the execution using the captured data. When searching for the root cause of a bug, the designer may need to modify the instrumentation to collect data from a new part of the design, requiring a lengthy recompile. In this paper, we propose a flexible debug overlay family that provides software-like debug turn-around times for HLS generated circuits. At compile time, the overlay is added to the design and compiled. At debug time, the overlay can be configured many times to implement specific debug scenarios without a recompilation. This paper first outlines a number of \"capabilities\" that such an overlay should have, and then describes architectural support for each of these capabilities. The cheapest overlay variant allows selective variable tracing with only a 1.7% increase in area overhead from the baseline debug instrumentation, while the deluxe variant offers 2x-7x improvement in trace buffer memory utilization with conditional buffer freeze support.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nonvolatile FPGAs (NV-FPGAs) have a potential advantage to eliminate wasted standby power which is increasingly serious in recent standard SRAM-based FPGAs. However, functionality of the conventional NV-FPGAs are not sufficient compared to that of standard SRAM-based FPGAs. For example, an effective circuit structure to perform shift-register (SR) function has not been proposed yet. In this paper, a magnetic tunnel junction (MTJ) based nonvolatile lookup table (NV-LUT) circuit that can perform SR function with low power consumption is proposed. The MTJ device is the best candidate in terms of virtually unlimited endurance, CMOS compatibility, and 3D stacking capability. On the other hand, large power consumption to perform SR function a serious design issue for the MTJ-based NV-LUT circuit. Since the write current for the MTJ device is large and all the data must be updated after the SR operation using CMOS-oriented method, large power consumption is indispensable. To overcome this issue, the address for read/write access is incremented at each cycle instead of direct data shifting in the proposed LUT circuit. In this way, the number of data update per 1-bit shift is minimized to one, which results in great power saving. Moreover, since the selector is shared both read (logic) and write operation, its hardware cost is small. In fact, 99% of power reduction and 52% of transistor counts reduction compared to those of SRAM-based LUT circuit are performed. The authors would like to acknowledge ImPACT of CSTI, CIES consortium program, JST-OPERA, and JSPS KAKENHI Grant No. 17H06093.
非易失性fpga (nv - fpga)在消除待机功率浪费方面具有潜在的优势,而待机功率浪费在最近基于sram的标准fpga中日益严重。然而,与标准的基于sram的fpga相比,传统的nv - fpga的功能是不够的。例如,实现移位寄存器(SR)功能的有效电路结构尚未被提出。本文提出了一种基于磁隧道结(MTJ)的非易失性查找表(NV-LUT)电路,该电路可以在低功耗下实现SR功能。MTJ器件在几乎无限的耐用性,CMOS兼容性和3D堆叠能力方面是最佳候选。另一方面,执行SR功能的大功耗是基于mtj的NV-LUT电路的一个严重设计问题。由于MTJ器件的写电流很大,并且采用面向cmos的方法进行SR操作后必须更新所有数据,因此大功耗是必不可少的。为了克服这个问题,在建议的LUT电路中,读/写访问的地址在每个周期递增,而不是直接进行数据移动。这样,每1位移位的数据更新次数减少到1,从而大大节省了电力。此外,由于选择器的读(逻辑)和写操作都是共享的,所以它的硬件开销很小。实际上,与基于sram的LUT电路相比,功耗降低99%,晶体管数量减少52%。作者感谢CSTI、CIES联合项目、JST-OPERA和JSPS KAKENHI Grant No. 17H06093的影响。
{"title":"Design of an MTJ-Based Nonvolatile LUT Circuit with a Data-Update Minimized Shift Operation for an Ultra-Low-Power FPGA: (Abstract Only)","authors":"D. Suzuki, T. Hanyu","doi":"10.1145/3174243.3174984","DOIUrl":"https://doi.org/10.1145/3174243.3174984","url":null,"abstract":"Nonvolatile FPGAs (NV-FPGAs) have a potential advantage to eliminate wasted standby power which is increasingly serious in recent standard SRAM-based FPGAs. However, functionality of the conventional NV-FPGAs are not sufficient compared to that of standard SRAM-based FPGAs. For example, an effective circuit structure to perform shift-register (SR) function has not been proposed yet. In this paper, a magnetic tunnel junction (MTJ) based nonvolatile lookup table (NV-LUT) circuit that can perform SR function with low power consumption is proposed. The MTJ device is the best candidate in terms of virtually unlimited endurance, CMOS compatibility, and 3D stacking capability. On the other hand, large power consumption to perform SR function a serious design issue for the MTJ-based NV-LUT circuit. Since the write current for the MTJ device is large and all the data must be updated after the SR operation using CMOS-oriented method, large power consumption is indispensable. To overcome this issue, the address for read/write access is incremented at each cycle instead of direct data shifting in the proposed LUT circuit. In this way, the number of data update per 1-bit shift is minimized to one, which results in great power saving. Moreover, since the selector is shared both read (logic) and write operation, its hardware cost is small. In fact, 99% of power reduction and 52% of transistor counts reduction compared to those of SRAM-based LUT circuit are performed. The authors would like to acknowledge ImPACT of CSTI, CIES consortium program, JST-OPERA, and JSPS KAKENHI Grant No. 17H06093.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133743775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Now, Convolutional Neural Network (CNN) has gained great popularity. Intensive computation and huge external data access amount are two challenged factors for the hardware acceleration. Besides these, the ability to deal with various CNN models is also challenged. At present, most of the proposed FPGA-based CNN accelerator either can only deal with specific CNN models or should be re-coded and re-download on the FPGA for the different CNN models. This would bring great trouble for the developers. In this paper, we designed a software-defined architecture to cope with different CNN models while keeping high throughput. The hardware can be programmed according to the requirement. Several techniques are proposed to optimize the performance of our accelerators. For the convolutional layer, we proposed the software-defined data reuse technique to ensure that all the parameters can be only loaded once during the computing phase. This will reduce large off-chip data access amount and the need for the memory and the need for the memory bandwidth. By using the sparse property of the input feature map, almost 80% weight parameters can be skipped to be loaded in the full-connected (FC) layer. Compared to the previous works, our software-defined accelerator has the highest flexibility while keeping relative high throughout. Besides this, our accelerator also has lower off-chip data access amount which has a great effect on the power consumption.
{"title":"Software-Defined FPGA-Based Accelerator for Deep Convolutional Neural Networks: (Abstract Only)","authors":"Yankang Du, Qinrang Liu, Shuai Wei, Chen Gao","doi":"10.1145/3174243.3174983","DOIUrl":"https://doi.org/10.1145/3174243.3174983","url":null,"abstract":"Now, Convolutional Neural Network (CNN) has gained great popularity. Intensive computation and huge external data access amount are two challenged factors for the hardware acceleration. Besides these, the ability to deal with various CNN models is also challenged. At present, most of the proposed FPGA-based CNN accelerator either can only deal with specific CNN models or should be re-coded and re-download on the FPGA for the different CNN models. This would bring great trouble for the developers. In this paper, we designed a software-defined architecture to cope with different CNN models while keeping high throughput. The hardware can be programmed according to the requirement. Several techniques are proposed to optimize the performance of our accelerators. For the convolutional layer, we proposed the software-defined data reuse technique to ensure that all the parameters can be only loaded once during the computing phase. This will reduce large off-chip data access amount and the need for the memory and the need for the memory bandwidth. By using the sparse property of the input feature map, almost 80% weight parameters can be skipped to be loaded in the full-connected (FC) layer. Compared to the previous works, our software-defined accelerator has the highest flexibility while keeping relative high throughout. Besides this, our accelerator also has lower off-chip data access amount which has a great effect on the power consumption.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115332601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes CausaLearn, the first automated framework that enables real-time and scalable approximation of Probability Density Function (PDF) in the context of causal Bayesian graphical models. CausaLearn targets complex streaming scenarios in which the input data evolves over time and independence cannot be assumed between data samples (e.g., continuous time-varying data analysis). Our framework is devised using a HW/SW co-design approach. We provide the first implementation of Hamiltonian Markov Chain Monte Carlo on FPGA that can efficiently sample from the steady state probability distribution at scales while considering the correlation between the observed data. CausaLearn is customizable to the limits of the underlying resource provisioning in order to maximize the effective system throughput. It uses physical profiling to abstract high-level hardware characteristics. These characteristics are integrated into our automated customization unit in order to tile, schedule, and batch the PDF approximation workload corresponding to the pertinent platform resources and constraints. We benchmark the design performance for analyzing various massive time-series data on three FPGA platforms with different computational budgets. Our extensive evaluations demonstrate up to two orders-of-magnitude runtime and energy improvements compared to the best-known prior solution. We provide an accompanying API that can be leveraged by data scientists and practitioners to automate and abstract hardware design optimization.
{"title":"CausaLearn: Automated Framework for Scalable Streaming-based Causal Bayesian Learning using FPGAs","authors":"B. Rouhani, M. Ghasemzadeh, F. Koushanfar","doi":"10.1145/3174243.3174259","DOIUrl":"https://doi.org/10.1145/3174243.3174259","url":null,"abstract":"This paper proposes CausaLearn, the first automated framework that enables real-time and scalable approximation of Probability Density Function (PDF) in the context of causal Bayesian graphical models. CausaLearn targets complex streaming scenarios in which the input data evolves over time and independence cannot be assumed between data samples (e.g., continuous time-varying data analysis). Our framework is devised using a HW/SW co-design approach. We provide the first implementation of Hamiltonian Markov Chain Monte Carlo on FPGA that can efficiently sample from the steady state probability distribution at scales while considering the correlation between the observed data. CausaLearn is customizable to the limits of the underlying resource provisioning in order to maximize the effective system throughput. It uses physical profiling to abstract high-level hardware characteristics. These characteristics are integrated into our automated customization unit in order to tile, schedule, and batch the PDF approximation workload corresponding to the pertinent platform resources and constraints. We benchmark the design performance for analyzing various massive time-series data on three FPGA platforms with different computational budgets. Our extensive evaluations demonstrate up to two orders-of-magnitude runtime and energy improvements compared to the best-known prior solution. We provide an accompanying API that can be leveraged by data scientists and practitioners to automate and abstract hardware design optimization.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120869578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Guerrieri, Sahand Kashani-Akhavan, Mikhail Asiatici, P. Lombardi, B. Belhadj, P. Ienne
Modern heterogeneous SoCs (System-on-Chip) contain a set of Hard IPs (HIPs) surrounded by an FPGA fabric for hosting custom Hardware Accelerators (HAs). However, efficiently managing such HAs in an embedded Linux environment involves creating and building custom device drivers specific to the target platform, which negatively impacts development cost, portability and time-to-market. To address this issue, we present LEOSoC, an open-source cross-platform embedded Linux library. LEOSoC reduces the development effort required to interface HAs with applications and makes SoCs easy to use for an embedded software developer who is familiar with the semantics of standard POSIX threads. Using LEOSoC does not require any specific version of the Linux kernel, nor to rebuild a custom driver for each new kernel release. LEOSoC consists of a base hardware system and a software layer. Both hardware and software are portable across SoC from various vendors and the library recognizes and auto-adapts to the target SoC platform on which it is running. Furthermore, LEOSoC allows the application to partially or completely change the structure of the HAs at runtime without rebooting the system by leveraging the underlying platforms? support for dynamic full/partial FPGA reconfigurability. The system has been tested on multiple COTS (Commercial Off The Shelf) boards from different vendors, each one running different versions of Linux and, therefore, proving the real portability and usability of LEOSoC in a specific industrial design.
{"title":"LEOSoC: An Open-Source Cross-Platform Embedded Linux Library for Managing Hardware Accelerators in Heterogeneous System-on-Chips(Abstract Only)","authors":"Andrea Guerrieri, Sahand Kashani-Akhavan, Mikhail Asiatici, P. Lombardi, B. Belhadj, P. Ienne","doi":"10.1145/3174243.3175002","DOIUrl":"https://doi.org/10.1145/3174243.3175002","url":null,"abstract":"Modern heterogeneous SoCs (System-on-Chip) contain a set of Hard IPs (HIPs) surrounded by an FPGA fabric for hosting custom Hardware Accelerators (HAs). However, efficiently managing such HAs in an embedded Linux environment involves creating and building custom device drivers specific to the target platform, which negatively impacts development cost, portability and time-to-market. To address this issue, we present LEOSoC, an open-source cross-platform embedded Linux library. LEOSoC reduces the development effort required to interface HAs with applications and makes SoCs easy to use for an embedded software developer who is familiar with the semantics of standard POSIX threads. Using LEOSoC does not require any specific version of the Linux kernel, nor to rebuild a custom driver for each new kernel release. LEOSoC consists of a base hardware system and a software layer. Both hardware and software are portable across SoC from various vendors and the library recognizes and auto-adapts to the target SoC platform on which it is running. Furthermore, LEOSoC allows the application to partially or completely change the structure of the HAs at runtime without rebooting the system by leveraging the underlying platforms? support for dynamic full/partial FPGA reconfigurability. The system has been tested on multiple COTS (Commercial Off The Shelf) boards from different vendors, each one running different versions of Linux and, therefore, proving the real portability and usability of LEOSoC in a specific industrial design.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"470 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115870422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skyline Computation is a method for extracting interesting entries from a large population with multiple attributes. These entries, called skyline or Pareto optimal entries, are known to have extreme characteristics that cannot be found by using outlier detection methods. Skyline computation is an important task for characterizing large amounts of data and selecting interesting entries with extreme features. When the population changes dynamically, the task of calculating a sequence of skyline sets is called a continuous skyline computation. This task is known to be difficult for the following reasons: (1) information must be kept for non-skyline entries, since they may join the skyline in the future; (2) the appearance or disappearance of even a single entry can change the skyline drastically; and (3) it is difficult to adopt a geometric acceleration algorithm for skyline computation tasks with high-dimensional datasets. A new algorithm, called jointed rooted-tree (JR-tree), has been developed that manages entries using a rooted-tree structure. JR-tree delays extend the tree to deeper levels to accelerate tree construction and traversal. In this study, we propose the JR-tree based continuous skyline computation acceleration algorithm. Our hardware algorithm parallelizes the calculations of dominance relation between a target entry and the skyline entries. We implemented our hardware algorithm on an FPGA and showed that high-speed tree construction and traversal can be realized. Comparing our FPGA-based implementation with an Intel CPU running state-of-the-art software algorithms, it was found to reduce the query processing time for synthetic and real-world datasets. Our hardware implementation is 1.7x to 35x faster than the software implementations.
{"title":"Continuous Skyline Computation Accelerator with Parallelizing Dominance Relation Calculations: (Abstract Only)","authors":"Kenichi Koizumi, K. Hiraki, M. Inaba","doi":"10.1145/3174243.3174961","DOIUrl":"https://doi.org/10.1145/3174243.3174961","url":null,"abstract":"Skyline Computation is a method for extracting interesting entries from a large population with multiple attributes. These entries, called skyline or Pareto optimal entries, are known to have extreme characteristics that cannot be found by using outlier detection methods. Skyline computation is an important task for characterizing large amounts of data and selecting interesting entries with extreme features. When the population changes dynamically, the task of calculating a sequence of skyline sets is called a continuous skyline computation. This task is known to be difficult for the following reasons: (1) information must be kept for non-skyline entries, since they may join the skyline in the future; (2) the appearance or disappearance of even a single entry can change the skyline drastically; and (3) it is difficult to adopt a geometric acceleration algorithm for skyline computation tasks with high-dimensional datasets. A new algorithm, called jointed rooted-tree (JR-tree), has been developed that manages entries using a rooted-tree structure. JR-tree delays extend the tree to deeper levels to accelerate tree construction and traversal. In this study, we propose the JR-tree based continuous skyline computation acceleration algorithm. Our hardware algorithm parallelizes the calculations of dominance relation between a target entry and the skyline entries. We implemented our hardware algorithm on an FPGA and showed that high-speed tree construction and traversal can be realized. Comparing our FPGA-based implementation with an Intel CPU running state-of-the-art software algorithms, it was found to reduce the query processing time for synthetic and real-world datasets. Our hardware implementation is 1.7x to 35x faster than the software implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"385 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 2: CAD","authors":"Sabyasachi Das","doi":"10.1145/3252937","DOIUrl":"https://doi.org/10.1145/3252937","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122380805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jakub Cabal, Pavel Benácek, Lukás Kekely, Michal Kekely, V. Pus, J. Korenek
As throughput of computer networks is on a constant rise, there is a need for ever-faster packet parsing modules at all points of the networking infrastructure. Parsing is a crucial operation which has an influence on the final throughput of a network device. Moreover, this operation must precede any kind of further traffic processing like filtering/classification, deep packet inspection, and so on. This paper presents a parser architecture which is capable to currently scale up to a terabit throughput in a single FPGA, while the overall processing speed is sustained even on the shortest frame lengths and for an arbitrary number of supported protocols. The architecture of our parser can be also automatically generated from a high-level description of a protocol stack in the P4 language which makes the rapid deployment of new protocols considerably easier. The results presented in the paper confirm that our automatically generated parsers are capable of reaching an effective throughput of over 1 Tbps (or more than 2000 Mpps) on the Xilinx UltraScale+ FPGAs and around 800 Gbps (or more than 1200 Mpps) on their previous generation Virtex-7 FPGAs.
{"title":"Configurable FPGA Packet Parser for Terabit Networks with Guaranteed Wire-Speed Throughput","authors":"Jakub Cabal, Pavel Benácek, Lukás Kekely, Michal Kekely, V. Pus, J. Korenek","doi":"10.1145/3174243.3174250","DOIUrl":"https://doi.org/10.1145/3174243.3174250","url":null,"abstract":"As throughput of computer networks is on a constant rise, there is a need for ever-faster packet parsing modules at all points of the networking infrastructure. Parsing is a crucial operation which has an influence on the final throughput of a network device. Moreover, this operation must precede any kind of further traffic processing like filtering/classification, deep packet inspection, and so on. This paper presents a parser architecture which is capable to currently scale up to a terabit throughput in a single FPGA, while the overall processing speed is sustained even on the shortest frame lengths and for an arbitrary number of supported protocols. The architecture of our parser can be also automatically generated from a high-level description of a protocol stack in the P4 language which makes the rapid deployment of new protocols considerably easier. The results presented in the paper confirm that our automatically generated parsers are capable of reaching an effective throughput of over 1 Tbps (or more than 2000 Mpps) on the Xilinx UltraScale+ FPGAs and around 800 Gbps (or more than 1200 Mpps) on their previous generation Virtex-7 FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}