首页 > 最新文献

2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
A prototyping system for hardware distributed objects with diversity of programming languages design and preliminary evaluation 基于多种编程语言的硬件分布式对象原型系统设计与初步评估
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718418
Takeshi Ohkawa, T. Yokota, K. Ootsu
A prototyping system for hardware distributed objects using a hardwired ORB (Object Request Broker) protocol processing engine was implemented on Xilinx Zynq-7000 platform; by which a circuit IP on an FPGA can be operated from application software on Linux/ARM processor through an object-oriented method call. The proposed framework increases controllability and design-productivity on FPGA-based systems. A developer can define an object-oriented interface for a circuit IP in an FPGA, and implement the control sequence part using JavaRock Java-to-HDL synthesizer. By the conformance to the standard CORBA (Common Object Request Broker Architecture) protocol, circuit IPs in an FPGA can be handled through object-oriented interface from diversity of programing languages; like C++, Java, Python and so on. The round trip delay performance measurement of the prototype system was done on Xillybus FIFO interface channel.
采用硬连线ORB (Object Request Broker)协议处理引擎,在Xilinx Zynq-7000平台上实现了硬件分布式对象原型系统;通过面向对象的方法调用,可以从Linux/ARM处理器上的应用软件操作FPGA上的电路IP。该框架提高了基于fpga的系统的可控性和设计效率。开发人员可以在FPGA中为电路IP定义面向对象接口,并使用JavaRock Java-to-HDL合成器实现控制序列部分。通过符合标准的CORBA(公共对象请求代理体系结构)协议,FPGA中的电路ip可以通过多种编程语言的面向对象接口进行处理;比如c++, Java, Python等等。在Xillybus FIFO接口信道上对原型系统进行了往返时延性能测量。
{"title":"A prototyping system for hardware distributed objects with diversity of programming languages design and preliminary evaluation","authors":"Takeshi Ohkawa, T. Yokota, K. Ootsu","doi":"10.1109/FPT.2013.6718418","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718418","url":null,"abstract":"A prototyping system for hardware distributed objects using a hardwired ORB (Object Request Broker) protocol processing engine was implemented on Xilinx Zynq-7000 platform; by which a circuit IP on an FPGA can be operated from application software on Linux/ARM processor through an object-oriented method call. The proposed framework increases controllability and design-productivity on FPGA-based systems. A developer can define an object-oriented interface for a circuit IP in an FPGA, and implement the control sequence part using JavaRock Java-to-HDL synthesizer. By the conformance to the standard CORBA (Common Object Request Broker Architecture) protocol, circuit IPs in an FPGA can be handled through object-oriented interface from diversity of programing languages; like C++, Java, Python and so on. The round trip delay performance measurement of the prototype system was done on Xillybus FIFO interface channel.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"246 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124712406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Partially reconfigurable flux calculation scheme in advection term computation 平流项计算中部分可重构通量计算方案
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718393
M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano
Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.
快速空气动力学程序(FaSTAR)是最新的流体动力学软件包之一。由于FaSTAR的数据结构不规则且不可预测,使得其难以在并行机中执行。利用可重构硬件的优势来弥补现有高性能计算机的不足已逐渐成为解决方案。然而,对于FaSTAR封装来说,单个FPGA是不够的,因为整个模块非常大。而不是使用许多fpga,部分可重构的硬件可用在最近的fpga探索这个应用。选择FaSTAR中的平流项计算模块作为目标子程序。为了节省单个FPGA的硬件资源,提出了一种采用部分重构技术的可重构通量计算方案。我们开发了通量计算模块,并以可重构模块的形式实现了5种通量计算方案。该实现的优点是资源节省高达62.75%,配置速度提高6.28倍。性能评估还显示,与2.4 GHz的英特尔酷睿2双核相比,它的加速速度达到了2.65倍。
{"title":"Partially reconfigurable flux calculation scheme in advection term computation","authors":"M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano","doi":"10.1109/FPT.2013.6718393","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718393","url":null,"abstract":"Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131342812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
TROJANUS: An ultra-lightweight side-channel leakage generator for FPGAs TROJANUS:用于fpga的超轻型侧通道泄漏发生器
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718347
Sebastian Kutzner, A. Poschmann, Marc Stöttinger
In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.
在本文中,我们提出了一种新的fpga侧信道构建块,它类似于古罗马的Janus神,具有两个矛盾的面孔:作为水印工具,它允许通过在设计中添加单个切片来唯一地识别IP核;作为特洛伊木马侧通道(TSC),它可以在一个跟踪中泄露整个加密密钥,而不需要知道明文或密文。通过将TROJANUS作为TSC嵌入到PRESENT的轻量级FPGA实现中,我们实际验证了TROJANUS的可行性。此外,我们更详细地研究了fpga的泄漏行为,并提出了一种新的预处理技术,该技术可能会增加DPA攻击的相关系数。
{"title":"TROJANUS: An ultra-lightweight side-channel leakage generator for FPGAs","authors":"Sebastian Kutzner, A. Poschmann, Marc Stöttinger","doi":"10.1109/FPT.2013.6718347","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718347","url":null,"abstract":"In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123478621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture 基于非阻塞粗粒度可重构阵列架构的移动GPU着色器处理器
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718353
Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu
Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.
基于粗粒度可重构阵列(CGRAs)的处理器通过重新配置连接ALU阵列的数据路径的能力,提供了高性能、高能效以及可编程性。基于CGRA的处理器执行循环内核,其调度应该在编译时固定。这种限制阻碍了CGRA的效率,特别是在访问访问时间变化很大的外部存储器或缓存时。这使得构建基于CGRA的高性能,节能的移动GPU变得具有挑战性,因为GPU着色器的执行通常涉及大量纹理内存访问,包括访问纹理缓存和外部纹理内存。在本文中,我们提出了一种非阻塞粗粒度可重构阵列(NBC-GRA)架构,它可以有效地处理变延迟操作。在此基础上提出了一种改进的基于CGRA的GPU着色处理器架构。重试缓冲区允许线程在完成所需的内存访问后重新执行。使用非阻塞纹理缓存,着色器核心即使在缓存丢失的情况下也可以不停顿地执行。尽管内存访问延迟较长,但所有这些组件都有助于极大地提高CGRA核心吞吐量。评估结果表明,我们基于NBCGRA架构的着色器处理器可以在纹理缓存访问延迟极端变化的情况下高效地执行,并且可以在最小的硬件成本开销下减少着色器执行周期高达68%。
{"title":"Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture","authors":"Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu","doi":"10.1109/FPT.2013.6718353","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718353","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"122 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114113814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Direct virtual memory access from FPGA for high-productivity heterogeneous computing 从FPGA直接访问虚拟内存,实现高生产率异构计算
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718414
Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.
利用CPU和FPGA的异构计算需要从这两个设备访问主存储器中的数据。虽然典型的系统依赖于在CPU上执行的软件来协调FPGA和主存储器之间的所有数据移动,但我们的演示展示了一种互补的以FPGA为中心的方法,该方法允许网关直接访问虚拟内存空间,作为执行过程的一部分,而无需涉及CPU。与用户FPGA网关一起实现了一个缓存地址转换缓冲区,以提供虚拟和物理内存地址之间的运行时映射。该系统在商用现成的FPGA附加卡上实现,以证明这种方法在低成本系统中的可行性。与典型的以软件为中心的实现相比,实验证明了合理的性能改进;而在内核模式和用户模式下FPGA和CPU之间的上下文切换数量都大大减少了,从而为其他并发用户任务释放了CPU。
{"title":"Direct virtual memory access from FPGA for high-productivity heterogeneous computing","authors":"Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So","doi":"10.1109/FPT.2013.6718414","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718414","url":null,"abstract":"Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121125890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Real-time ray tracing on coarse-grained reconfigurable processor 基于粗粒度可重构处理器的实时光线追踪
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718352
J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim
Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.
光线追踪是一种通过模拟光的路径来生成图像的3D渲染方法。它可以生成高质量的图像,但需要强大的计算能力。光线追踪技术的最新进展使现代桌面cpu / gpu上的实时光线追踪成为可能。但在当前的移动环境下,由于移动gpu的计算能力、内存带宽和灵活性不足,实现这一目标非常困难。本文提出了一种基于三星可重构处理器(SRP)的移动光线追踪系统。SRP体系结构包括一个紧密耦合的甚长指令字(VLIW)引擎和粗粒度可重构阵列(CGRA)。VLIW引擎是为通用计算(如函数调用和分支选择)而设计的,粗粒度可重构数组专门用于程序的数据密集型部分,并且可以动态配置。提出了基于迭代批处理的SRP光线跟踪算法,并利用本地存储器和数据缓存优化了内存带宽。我们的光线追踪系统是在商业的基于fpga的原型系统上实现的。实验结果表明,该系统适用于移动光线跟踪。
{"title":"Real-time ray tracing on coarse-grained reconfigurable processor","authors":"J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim","doi":"10.1109/FPT.2013.6718352","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718352","url":null,"abstract":"Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
An acceleration method of short read mapping using FPGA 一种基于FPGA的短读映射加速方法
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718385
Y. Sogabe, T. Maruyama
The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.
下一代测序(NGS)的快速发展使得一台机器每天可以产生超过100G的碱基对。生成的数据是随机分割的DNA碱基对串,称为短读,数百万个短读被映射到参考基因组上,这些基因组是完整的基因序列,以重建样本DNA的序列。这种短读映射正在成为NGS系统的瓶颈。在本文中,我们提出了一个基于哈希索引方法的映射FPGA系统。在我们的系统中,短读被分成种子,种子是用于映射的固定长度的子字符串,种子使用bucket进行排序。然后,将每个桶中的种子与候选位置并行比较。通过这种方法,可以以大规模并行的方式将许多种子与其候选位置进行比较,并且可以通过减少对存储候选位置的DRAM库的随机访问次数来提高处理速度。此外,在这种平行比较中,可以允许对种子中的核苷酸进行替换。这使得实现比以前的工作更高的匹配率成为可能。
{"title":"An acceleration method of short read mapping using FPGA","authors":"Y. Sogabe, T. Maruyama","doi":"10.1109/FPT.2013.6718385","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718385","url":null,"abstract":"The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125019670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A defect-tolerant cluster in a mesh SRAM-based FPGA 基于网格sram的FPGA中的容错集群
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718407
Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi
In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.
在本文中,我们提出在基于sram的FPGA上实现多种容错技术。这些技术包括逻辑块和集群内互连的冗余。在逻辑块中,冗余在多路复用器级别实现。在考虑所有可能的位置和输入组合的情况下,通过在多路复用器的输出端注入单个缺陷来分析其效率。而在互连层,引入细粒度冗余,不仅绕过了缺陷,而且提高了可达性。利用稀疏的簇内互连结构,通过有效的反馈路径分配进一步提高了可达性,从而使逻辑块之间的连接更加灵活。仿真结果表明,该方法对逻辑块和簇内互连的鲁棒性分别提高了15%和34%。此外,还从最大可实现故障覆盖率和各自成本的角度研究了这些强化方案对FPGA集群制造缺陷可测试性的影响。
{"title":"A defect-tolerant cluster in a mesh SRAM-based FPGA","authors":"Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi","doi":"10.1109/FPT.2013.6718407","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718407","url":null,"abstract":"In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129381157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Debugging processors with advanced features by reprogramming LUTs on FPGA 通过在FPGA上重新编程lut来调试具有高级功能的处理器
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718329
Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita
In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.
在本文中,我们提出了一种自动化的方法来调试和纠正在fpga上实现的处理器中的逻辑错误。我们的方法是基于保留当前电路拓扑结构,并通过仅更改lut的内容来调试和纠正错误,而不修改布线。因此,纠正错误不需要重新合成,由于可能存在计时闭包问题,这对于复杂的处理器来说非常耗时。由于电路的拓扑结构被保留,修正错误不会影响电路的时序。在设计阶段,我们可以在原电路中增加额外的lut或额外的输入,以便我们可以在调试和整流阶段使用它们。在发现错误之后,首先我们尝试识别候选信号以及纠正其行为所需的更改。这是通过在微体系结构层面上对处理器的指令集体系结构模型和错误模型进行符号模拟和等价检验来实现的。然后,我们尝试将校正后的功能映射到现有的LUT拓扑中。这是通过一种新的方法来实现的,该方法将问题表述为QBF(量化布尔公式)问题,并通过重复地增量应用常规SAT求解器来解决问题,而不是利用CEGAR(反例引导抽象细化)范式的思想来解决问题。通过采用定时错误恢复机制对两个复杂无序超标量处理器的错误进行校正,证明了该方法的有效性和高效性。
{"title":"Debugging processors with advanced features by reprogramming LUTs on FPGA","authors":"Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita","doi":"10.1109/FPT.2013.6718329","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718329","url":null,"abstract":"In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
High-level synthesis of dynamic data structures: A case study using Vivado HLS 动态数据结构的高级综合:使用Vivado HLS的案例研究
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718388
F. Winterstein, Samuel Bayliss, G. Constantinides
High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.
与使用寄存器传输级(RTL)语言的设计入口相比,高级综合承诺显著缩短FPGA设计周期。最近的评估报告称,C-to-RTL流程可以产生接近手工设计的质量结果[1]。使用动态的、基于指针的数据结构的算法在软件中很常见,但很难很好地实现。在本文中,我们描述了使用Xilinx Vivado HLS作为示范性的最先进的高级合成工具的比较案例研究。我们的测试用例是同一种计算密集型机器学习技术(聚类)的两种可选算法,它们具有显著不同的计算特性。我们将以数据流为中心的实现与递归树遍历实现进行比较,后者结合了复杂的依赖数据的控制流,并利用指针链接的数据结构和动态内存分配。这个案例研究的结果是双重的:对于第一个测试用例,我们确认了手写和自动生成的RTL设计之间的相似性能。第二种情况表明,如果在高级合成之前未更改源代码,则延迟的降低幅度将大于30倍。我们确定了这个缺点的原因,并给出了将性能差距缩小到四倍的代码转换。我们概括了源到源的转换,这些转换的自动化激励了研究方向,以提高未来动态数据结构的高级综合。
{"title":"High-level synthesis of dynamic data structures: A case study using Vivado HLS","authors":"F. Winterstein, Samuel Bayliss, G. Constantinides","doi":"10.1109/FPT.2013.6718388","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718388","url":null,"abstract":"High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125628172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 113
期刊
2013 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1