Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or non-uniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4.3× in terms of clock cycles per iteration. The optimized pipelines consume 2.0× as many LUTs, 1.8× as many registers, and 1.1× as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2.
{"title":"Loop Splitting for Efficient Pipelining in High-Level Synthesis","authors":"Junyi Liu, John Wickerson, G. Constantinides","doi":"10.1109/FCCM.2016.27","DOIUrl":"https://doi.org/10.1109/FCCM.2016.27","url":null,"abstract":"Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or non-uniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4.3× in terms of clock cycles per iteration. The optimized pipelines consume 2.0× as many LUTs, 1.8× as many registers, and 1.1× as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124055909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Data Types (ADTs) such as dictionaries and lists are essential for many embedded computing applications such as network stacks. However, in heterogeneous systems, code using ADTs can usually only run in CPUs, because components written in HLS do not support dynamic data structures. HLS tools cannot be used to synthesise dynamic data structures directly because the use of pointers is very restricted, such as not supporting pointers to pointers or pointer casting. Consequently, it is unclear what the API should look like and how to express dynamic data structures in HLS so that the tools can compile them. We propose SynADT, which consists of a methodology and a benchmark. The methodology provides classic data structures (linked lists, binary trees, hash tables and vectors) using relativeaddresses instead of pointers in Vivado HLS. The benchmark can be used to evaluate the performance of data structures in HLS, ARM processors and soft processors such as MicroBlaze, CPUs can utilise either the default C memory allocator or a hardware memory allocator. We evaluate the data structures in a Zynq FPGA demonstrating scaling to approximately 10MB memory usage and 1M data items. With a workload that utilises 10MB memory, the HLS data structures operating at 150MHz are on average 1.35× faster than MicroBlaze data structures operating at 150MHz with the default C allocator and 7.97× slower than ARM processor data structures operating at 667MHz with the default C allocator.
{"title":"SynADT: Dynamic Data Structures in High Level Synthesis","authors":"Zeping Xue, David B. Thomas","doi":"10.1109/FCCM.2016.26","DOIUrl":"https://doi.org/10.1109/FCCM.2016.26","url":null,"abstract":"Abstract Data Types (ADTs) such as dictionaries and lists are essential for many embedded computing applications such as network stacks. However, in heterogeneous systems, code using ADTs can usually only run in CPUs, because components written in HLS do not support dynamic data structures. HLS tools cannot be used to synthesise dynamic data structures directly because the use of pointers is very restricted, such as not supporting pointers to pointers or pointer casting. Consequently, it is unclear what the API should look like and how to express dynamic data structures in HLS so that the tools can compile them. We propose SynADT, which consists of a methodology and a benchmark. The methodology provides classic data structures (linked lists, binary trees, hash tables and vectors) using relativeaddresses instead of pointers in Vivado HLS. The benchmark can be used to evaluate the performance of data structures in HLS, ARM processors and soft processors such as MicroBlaze, CPUs can utilise either the default C memory allocator or a hardware memory allocator. We evaluate the data structures in a Zynq FPGA demonstrating scaling to approximately 10MB memory usage and 1M data items. With a workload that utilises 10MB memory, the HLS data structures operating at 150MHz are on average 1.35× faster than MicroBlaze data structures operating at 150MHz with the default C allocator and 7.97× slower than ARM processor data structures operating at 667MHz with the default C allocator.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130055434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Improved quality of results from high level synthesis (HLS) tools have led to their increased adoption in hardware design. However, functional verification of HLS-produced designs remains a major challenge. Once a bug is exposed, designers must backtrace thousands of signals and simulation cycles to determine the underlying cause. The challenge is further exacerbated with HLS-produced non-human-readable RTL. In this paper, we present AutoSLIDE, an automated cross-layer verification framework that instruments critical operations, detects discrepancies between software and hardware execution, and traces the suspect datapath tree to identify bug source for the detected discrepancy. AutoSLIDE also maintains mappings between RTL datapath operations, LLVM-IR operations, and C/C++ source code to precisely pinpoint the root-cause of bugs to the exact line/operation in source code, substantially reducing user effort to localize bugs. We demonstrate the effectiveness by detecting and localizing bugs from former versions of the CHStone benchmark suite. Furthermore, we demonstrate the efficiency of AutoSLIDE, with low overhead in HLS time (27%), software trace gathering (10%), and significantly reduced trace size and simulation time compared to exhaustive instrumentation.
{"title":"AutoSLIDE: Automatic Source-Level Instrumentation and Debugging for HLS","authors":"Liwei Yang, S. Gurumani, Deming Chen, K. Rupnow","doi":"10.1109/FCCM.2016.38","DOIUrl":"https://doi.org/10.1109/FCCM.2016.38","url":null,"abstract":"Improved quality of results from high level synthesis (HLS) tools have led to their increased adoption in hardware design. However, functional verification of HLS-produced designs remains a major challenge. Once a bug is exposed, designers must backtrace thousands of signals and simulation cycles to determine the underlying cause. The challenge is further exacerbated with HLS-produced non-human-readable RTL. In this paper, we present AutoSLIDE, an automated cross-layer verification framework that instruments critical operations, detects discrepancies between software and hardware execution, and traces the suspect datapath tree to identify bug source for the detected discrepancy. AutoSLIDE also maintains mappings between RTL datapath operations, LLVM-IR operations, and C/C++ source code to precisely pinpoint the root-cause of bugs to the exact line/operation in source code, substantially reducing user effort to localize bugs. We demonstrate the effectiveness by detecting and localizing bugs from former versions of the CHStone benchmark suite. Furthermore, we demonstrate the efficiency of AutoSLIDE, with low overhead in HLS time (27%), software trace gathering (10%), and significantly reduced trace size and simulation time compared to exhaustive instrumentation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121088980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a computationally heavy task, suffering from rapid complexity scaling. This paper presents fpgaConvNet, a novel domain-specific modelling framework together with an automated design methodology for the mapping of ConvNets onto reconfigurable FPGA-based platforms. By interpreting ConvNet classification as a streaming application, the proposed framework employs the Synchronous Dataflow (SDF) model of computation as its basis and proposes a set of transformations on the SDF graph that explore the performance-resource design space, while taking into account platform-specific resource constraints. A comparison with existing ConvNet FPGA works shows that the proposed fully-automated methodology yields hardware designs that improve the performance density by up to 1.62× and reach up to 90.75% of the raw performance of architectures that are hand-tuned for particular ConvNets.
{"title":"fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs","authors":"Stylianos I. Venieris, C. Bouganis","doi":"10.1109/FCCM.2016.22","DOIUrl":"https://doi.org/10.1109/FCCM.2016.22","url":null,"abstract":"Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a computationally heavy task, suffering from rapid complexity scaling. This paper presents fpgaConvNet, a novel domain-specific modelling framework together with an automated design methodology for the mapping of ConvNets onto reconfigurable FPGA-based platforms. By interpreting ConvNet classification as a streaming application, the proposed framework employs the Synchronous Dataflow (SDF) model of computation as its basis and proposes a set of transformations on the SDF graph that explore the performance-resource design space, while taking into account platform-specific resource constraints. A comparison with existing ConvNet FPGA works shows that the proposed fully-automated methodology yields hardware designs that improve the performance density by up to 1.62× and reach up to 90.75% of the raw performance of architectures that are hand-tuned for particular ConvNets.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"65 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GRVI is an FPGA-efficient RISC-V RV32I soft processor. Phalanx is a parallel processor and accelerator array framework. Groups of processors and accelerators form shared memory clusters. Clusters are interconnected with each other and with extreme bandwidth I/O and memory devices by a Hoplite NOC with 300-bit links. An example Kintex UltraScale 040 system has 400 RISC-V cores, peak throughput of 100,000 MIPS, peak shared memory bandwidth of 600 GB/s, NOC bisection bandwidth of 700 Gb/s, and uses 12-17 W.
{"title":"GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator","authors":"J. Gray","doi":"10.1109/FCCM.2016.12","DOIUrl":"https://doi.org/10.1109/FCCM.2016.12","url":null,"abstract":"GRVI is an FPGA-efficient RISC-V RV32I soft processor. Phalanx is a parallel processor and accelerator array framework. Groups of processors and accelerators form shared memory clusters. Clusters are interconnected with each other and with extreme bandwidth I/O and memory devices by a Hoplite NOC with 300-bit links. An example Kintex UltraScale 040 system has 400 RISC-V cores, peak throughput of 100,000 MIPS, peak shared memory bandwidth of 600 GB/s, NOC bisection bandwidth of 700 Gb/s, and uses 12-17 W.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Agiakatsikas, N. T. H. Nguyen, Zhuoran Zhao, Tong Wu, E. Çetin, O. Diessel, Lingkan Gong
Field-Programmable Gate Arrays (FPGAs) provide ideal platforms for meeting the computational requirements of future space-based processing systems. However, FPGAs are susceptible to radiation-induced Single Event Upsets (SEUs). Techniques for dynamically reconfiguring corrupted modules of Triple Modular Redundant (TMR) components are well known. However, most of these techniques utilize resources that are themselves susceptible to SEUs to transfer reconfiguration requests from the TMR voters to a central reconfiguration controller. This paper evaluates the impact of these Reconfiguration Control Networks (RCNs) on the system's reliability and performance. We provide an overview of RCNs reported in the literature and compare them in terms of dependability, scalability and performance. We implemented our designs on a Xilinx Artix-7 FPGA to assess the resulting resource utilization and performance as well as to evaluate their soft error vulnerability using analytical techniques. We show that of the RCN topologies studied, an ICAP-based approach is the most reliable despite having the highest network latency. We also conclude that a module-based recovery approach is less reliable than scrubbing unless the RCN is triplicated and repaired when it suffers configuration memory errors.
{"title":"Reconfiguration Control Networks for TMR Systems with Module-Based Recovery","authors":"D. Agiakatsikas, N. T. H. Nguyen, Zhuoran Zhao, Tong Wu, E. Çetin, O. Diessel, Lingkan Gong","doi":"10.1109/FCCM.2016.30","DOIUrl":"https://doi.org/10.1109/FCCM.2016.30","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) provide ideal platforms for meeting the computational requirements of future space-based processing systems. However, FPGAs are susceptible to radiation-induced Single Event Upsets (SEUs). Techniques for dynamically reconfiguring corrupted modules of Triple Modular Redundant (TMR) components are well known. However, most of these techniques utilize resources that are themselves susceptible to SEUs to transfer reconfiguration requests from the TMR voters to a central reconfiguration controller. This paper evaluates the impact of these Reconfiguration Control Networks (RCNs) on the system's reliability and performance. We provide an overview of RCNs reported in the literature and compare them in terms of dependability, scalability and performance. We implemented our designs on a Xilinx Artix-7 FPGA to assess the resulting resource utilization and performance as well as to evaluate their soft error vulnerability using analytical techniques. We show that of the RCN topologies studied, an ICAP-based approach is the most reliable despite having the highest network latency. We also conclude that a module-based recovery approach is less reliable than scrubbing unless the RCN is triplicated and repaired when it suffers configuration memory errors.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131652043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Stratix 10 project started with aggressive performance, size, and feature goals, all to be met on a lean schedule. Meeting these performance goals led to a restructuring of the entire configurable clock system into a regular gridded network, which subdivided the device into a composable system of "sectors". Sectors aligned with the needs of the project schedule, since they allowed complexity -- of specification, design, and validation -- to be addressed through "divide and conquer". Similarly, the customary "out-of-band" FPGA management functions including initialization, configuration, test, redundancy, scrubbing, and so on, were reconstituted to run on a collection of per-sector and supervisory processors interconnected by a NoC, whose distributed software would replace centralized tightly coupled finite state machines. This softwarization and parallelization reduced risk, increased flexibility, and increased data bandwidth. During development, parallel teams separately exercised each sector type and its local processor software via the sector's clock and NoC ports, accelerating validation on design databases two orders of magnitude smaller compared to previous methodologies. Even complex features can be added by including new NoC packet types and software rather than painfully adding wires to a rigid floor-plan.
{"title":"Sectors: Divide & Conquer and Softwarization in the Design and Validation of the Stratix® 10 FPGA","authors":"D. How, Sean Atsatt","doi":"10.1109/FCCM.2016.37","DOIUrl":"https://doi.org/10.1109/FCCM.2016.37","url":null,"abstract":"The Stratix 10 project started with aggressive performance, size, and feature goals, all to be met on a lean schedule. Meeting these performance goals led to a restructuring of the entire configurable clock system into a regular gridded network, which subdivided the device into a composable system of \"sectors\". Sectors aligned with the needs of the project schedule, since they allowed complexity -- of specification, design, and validation -- to be addressed through \"divide and conquer\". Similarly, the customary \"out-of-band\" FPGA management functions including initialization, configuration, test, redundancy, scrubbing, and so on, were reconstituted to run on a collection of per-sector and supervisory processors interconnected by a NoC, whose distributed software would replace centralized tightly coupled finite state machines. This softwarization and parallelization reduced risk, increased flexibility, and increased data bandwidth. During development, parallel teams separately exercised each sector type and its local processor software via the sector's clock and NoC ports, accelerating validation on design databases two orders of magnitude smaller compared to previous methodologies. Even complex features can be added by including new NoC packet types and software rather than painfully adding wires to a rigid floor-plan.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125328292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi, M. Lattuada
Data analytics applications, such as graph databases, exibit irregular behaviors that make their acceleration non-trivial. These applications expose a significant amount of Task Level Parallelism (TLP), but they present fine grained memory accesses.
{"title":"A Dynamically Scheduled Architecture for the Synthesis of Graph Database Queries","authors":"Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi, M. Lattuada","doi":"10.1109/FCCM.2016.41","DOIUrl":"https://doi.org/10.1109/FCCM.2016.41","url":null,"abstract":"Data analytics applications, such as graph databases, exibit irregular behaviors that make their acceleration non-trivial. These applications expose a significant amount of Task Level Parallelism (TLP), but they present fine grained memory accesses.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116812224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayi Sheng, Qingqing Xiong, Chen Yang, M. Herbordt
Preliminary results are presented of hardware support for collective communication that takes advantage of a priori routing information.
给出了利用先验路由信息对集体通信的硬件支持的初步结果。
{"title":"Application-Aware Collective Communication (Extended Abstract)","authors":"Jiayi Sheng, Qingqing Xiong, Chen Yang, M. Herbordt","doi":"10.1109/FCCM.2016.55","DOIUrl":"https://doi.org/10.1109/FCCM.2016.55","url":null,"abstract":"Preliminary results are presented of hardware support for collective communication that takes advantage of a priori routing information.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127040376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advancements in genomic sequencing technology is causing genomic database growth to outpace Moore's Law. This continues to make genomic database search a difficult problem and a popular target for emerging processing technologies. The de facto software tool for genomic database search is NCBI BLAST, which operates by transforming each database query into a filter that is subsequently applied to the database. This requires a database scan for every query, fundamentally limiting its performance by I/O bandwidth. In this paper we present a functionally-equivalent variation on the NCBI BLAST algorithm that maps more suitably to an FPGA implementation. This variation of the algorithm attempts to reduce the I/O requirement by leveraging FPGA-specific capabilities, such as high pattern matching throughput and explicit on chip memory structure and allocation. Our algorithm transforms the database -- not the query -- into a filter that is stored as a hierarchical arrangement of three tables, the first two of which are stored on chip and the third off chip. Our results show that -- while performance is data dependent -- it is possible to achieve speedups of up to 8X based on the relative reduction in I/O of our approach versus that of NCBI BLAST. More importantly, the performance relative to NCBI BLAST improves with larger databases and query workload sizes.
{"title":"Two-Hit Filter Synthesis for Genomic Database Search","authors":"Jordan A. Bradshaw, Rasha Karakchi, J. Bakos","doi":"10.1109/FCCM.2016.24","DOIUrl":"https://doi.org/10.1109/FCCM.2016.24","url":null,"abstract":"Advancements in genomic sequencing technology is causing genomic database growth to outpace Moore's Law. This continues to make genomic database search a difficult problem and a popular target for emerging processing technologies. The de facto software tool for genomic database search is NCBI BLAST, which operates by transforming each database query into a filter that is subsequently applied to the database. This requires a database scan for every query, fundamentally limiting its performance by I/O bandwidth. In this paper we present a functionally-equivalent variation on the NCBI BLAST algorithm that maps more suitably to an FPGA implementation. This variation of the algorithm attempts to reduce the I/O requirement by leveraging FPGA-specific capabilities, such as high pattern matching throughput and explicit on chip memory structure and allocation. Our algorithm transforms the database -- not the query -- into a filter that is stored as a hierarchical arrangement of three tables, the first two of which are stored on chip and the third off chip. Our results show that -- while performance is data dependent -- it is possible to achieve speedups of up to 8X based on the relative reduction in I/O of our approach versus that of NCBI BLAST. More importantly, the performance relative to NCBI BLAST improves with larger databases and query workload sizes.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2020 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114466007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}