首页 > 最新文献

2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
A case for hardened multiplexers in FPGAs fpga中硬化多路复用器的案例
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718328
S. Chin, J. Anderson
This paper presents a case for a hybrid configurable logic block that contains a mixture of LUTs and hardened multiplexers towards the goal of higher logic density and area reduction. Technology mapping optimizations, called MuxMap, that target the proposed architecture are implemented using a modified version of the mapper in the ABC logic synthesis tool. VPR is used to model the new hybrid configurable logic block and verify post place and route implementation. Multiple hybrid configurable logic block architectures with varying MUX:LUT ratios are evaluated across three benchmark suites with both Quartus II and Odin-II front-end RTL synthesis tools. Experimentally, we show that without any mapper optimizations we naturally save ~4% area post place and route and with MuxMap optimizations in ABC yielding ~6% area reduction post place and route while maintaining mapping depth, overall configurable logic block count, and routing demand.
本文提出了一种混合可配置逻辑块的案例,该逻辑块包含lut和硬化多路复用器的混合,以实现更高的逻辑密度和面积减少。针对所建议的体系结构的技术映射优化,称为MuxMap,是使用ABC逻辑综合工具中的映射器的修改版本来实现的。利用VPR对新的混合可配置逻辑块进行建模,并验证后置和路由的实现。使用Quartus II和Odin-II前端RTL合成工具,在三个基准套件中评估具有不同MUX:LUT比率的多个混合可配置逻辑块架构。实验表明,在没有任何映射器优化的情况下,我们自然地节省了~4%的区域放置和路由,而在ABC中使用MuxMap优化,在保持映射深度、总体可配置逻辑块计数和路由需求的同时,减少了~6%的区域放置和路由。
{"title":"A case for hardened multiplexers in FPGAs","authors":"S. Chin, J. Anderson","doi":"10.1109/FPT.2013.6718328","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718328","url":null,"abstract":"This paper presents a case for a hybrid configurable logic block that contains a mixture of LUTs and hardened multiplexers towards the goal of higher logic density and area reduction. Technology mapping optimizations, called MuxMap, that target the proposed architecture are implemented using a modified version of the mapper in the ABC logic synthesis tool. VPR is used to model the new hybrid configurable logic block and verify post place and route implementation. Multiple hybrid configurable logic block architectures with varying MUX:LUT ratios are evaluated across three benchmark suites with both Quartus II and Odin-II front-end RTL synthesis tools. Experimentally, we show that without any mapper optimizations we naturally save ~4% area post place and route and with MuxMap optimizations in ABC yielding ~6% area reduction post place and route while maintaining mapping depth, overall configurable logic block count, and routing demand.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114478165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Partially reconfigurable flux calculation scheme in advection term computation 平流项计算中部分可重构通量计算方案
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718393
M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano
Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.
快速空气动力学程序(FaSTAR)是最新的流体动力学软件包之一。由于FaSTAR的数据结构不规则且不可预测,使得其难以在并行机中执行。利用可重构硬件的优势来弥补现有高性能计算机的不足已逐渐成为解决方案。然而,对于FaSTAR封装来说,单个FPGA是不够的,因为整个模块非常大。而不是使用许多fpga,部分可重构的硬件可用在最近的fpga探索这个应用。选择FaSTAR中的平流项计算模块作为目标子程序。为了节省单个FPGA的硬件资源,提出了一种采用部分重构技术的可重构通量计算方案。我们开发了通量计算模块,并以可重构模块的形式实现了5种通量计算方案。该实现的优点是资源节省高达62.75%,配置速度提高6.28倍。性能评估还显示,与2.4 GHz的英特尔酷睿2双核相比,它的加速速度达到了2.65倍。
{"title":"Partially reconfigurable flux calculation scheme in advection term computation","authors":"M. S. A. Talip, Takayuki Akamine, Mao Hatto, Yasunori Osana, N. Fujita, H. Amano","doi":"10.1109/FPT.2013.6718393","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718393","url":null,"abstract":"Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131342812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture 基于非阻塞粗粒度可重构阵列架构的移动GPU着色器处理器
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718353
Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu
Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.
基于粗粒度可重构阵列(CGRAs)的处理器通过重新配置连接ALU阵列的数据路径的能力,提供了高性能、高能效以及可编程性。基于CGRA的处理器执行循环内核,其调度应该在编译时固定。这种限制阻碍了CGRA的效率,特别是在访问访问时间变化很大的外部存储器或缓存时。这使得构建基于CGRA的高性能,节能的移动GPU变得具有挑战性,因为GPU着色器的执行通常涉及大量纹理内存访问,包括访问纹理缓存和外部纹理内存。在本文中,我们提出了一种非阻塞粗粒度可重构阵列(NBC-GRA)架构,它可以有效地处理变延迟操作。在此基础上提出了一种改进的基于CGRA的GPU着色处理器架构。重试缓冲区允许线程在完成所需的内存访问后重新执行。使用非阻塞纹理缓存,着色器核心即使在缓存丢失的情况下也可以不停顿地执行。尽管内存访问延迟较长,但所有这些组件都有助于极大地提高CGRA核心吞吐量。评估结果表明,我们基于NBCGRA架构的着色器处理器可以在纹理缓存访问延迟极端变化的情况下高效地执行,并且可以在最小的硬件成本开销下减少着色器执行周期高达68%。
{"title":"Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture","authors":"Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu","doi":"10.1109/FPT.2013.6718353","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718353","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"122 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114113814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
TROJANUS: An ultra-lightweight side-channel leakage generator for FPGAs TROJANUS:用于fpga的超轻型侧通道泄漏发生器
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718347
Sebastian Kutzner, A. Poschmann, Marc Stöttinger
In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.
在本文中,我们提出了一种新的fpga侧信道构建块,它类似于古罗马的Janus神,具有两个矛盾的面孔:作为水印工具,它允许通过在设计中添加单个切片来唯一地识别IP核;作为特洛伊木马侧通道(TSC),它可以在一个跟踪中泄露整个加密密钥,而不需要知道明文或密文。通过将TROJANUS作为TSC嵌入到PRESENT的轻量级FPGA实现中,我们实际验证了TROJANUS的可行性。此外,我们更详细地研究了fpga的泄漏行为,并提出了一种新的预处理技术,该技术可能会增加DPA攻击的相关系数。
{"title":"TROJANUS: An ultra-lightweight side-channel leakage generator for FPGAs","authors":"Sebastian Kutzner, A. Poschmann, Marc Stöttinger","doi":"10.1109/FPT.2013.6718347","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718347","url":null,"abstract":"In this article we present a new side-channel building block for FPGAs, which, akin to the old Roman god of Janus, has two contradictory faces: as a watermarking tool, it allows to uniquely identify IP cores by adding a single slice to the design; as a Trojan Side-Channel (TSC) it can potentially leak an entire encryption key within only one trace and without the knowledge of either the plaintext or the ciphertext. We practically verify TROJANUS' feasibility by embedding it as a TSC into a lightweight FPGA implementation of PRESENT. Besides, we investigate the leakage behavior of FPGAs in more detail and present a new pre-processing technique, which can potentially increase the correlation coefficient of DPA attacks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123478621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Direct virtual memory access from FPGA for high-productivity heterogeneous computing 从FPGA直接访问虚拟内存,实现高生产率异构计算
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718414
Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.
利用CPU和FPGA的异构计算需要从这两个设备访问主存储器中的数据。虽然典型的系统依赖于在CPU上执行的软件来协调FPGA和主存储器之间的所有数据移动,但我们的演示展示了一种互补的以FPGA为中心的方法,该方法允许网关直接访问虚拟内存空间,作为执行过程的一部分,而无需涉及CPU。与用户FPGA网关一起实现了一个缓存地址转换缓冲区,以提供虚拟和物理内存地址之间的运行时映射。该系统在商用现成的FPGA附加卡上实现,以证明这种方法在低成本系统中的可行性。与典型的以软件为中心的实现相比,实验证明了合理的性能改进;而在内核模式和用户模式下FPGA和CPU之间的上下文切换数量都大大减少了,从而为其他并发用户任务释放了CPU。
{"title":"Direct virtual memory access from FPGA for high-productivity heterogeneous computing","authors":"Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So","doi":"10.1109/FPT.2013.6718414","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718414","url":null,"abstract":"Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121125890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Real-time ray tracing on coarse-grained reconfigurable processor 基于粗粒度可重构处理器的实时光线追踪
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718352
J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim
Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.
光线追踪是一种通过模拟光的路径来生成图像的3D渲染方法。它可以生成高质量的图像,但需要强大的计算能力。光线追踪技术的最新进展使现代桌面cpu / gpu上的实时光线追踪成为可能。但在当前的移动环境下,由于移动gpu的计算能力、内存带宽和灵活性不足,实现这一目标非常困难。本文提出了一种基于三星可重构处理器(SRP)的移动光线追踪系统。SRP体系结构包括一个紧密耦合的甚长指令字(VLIW)引擎和粗粒度可重构阵列(CGRA)。VLIW引擎是为通用计算(如函数调用和分支选择)而设计的,粗粒度可重构数组专门用于程序的数据密集型部分,并且可以动态配置。提出了基于迭代批处理的SRP光线跟踪算法,并利用本地存储器和数据缓存优化了内存带宽。我们的光线追踪系统是在商业的基于fpga的原型系统上实现的。实验结果表明,该系统适用于移动光线跟踪。
{"title":"Real-time ray tracing on coarse-grained reconfigurable processor","authors":"J. D. Lee, Youngsam Shin, Won-Jong Lee, Soojung Ryu, Jeongwook Kim","doi":"10.1109/FPT.2013.6718352","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718352","url":null,"abstract":"Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
An acceleration method of short read mapping using FPGA 一种基于FPGA的短读映射加速方法
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718385
Y. Sogabe, T. Maruyama
The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.
下一代测序(NGS)的快速发展使得一台机器每天可以产生超过100G的碱基对。生成的数据是随机分割的DNA碱基对串,称为短读,数百万个短读被映射到参考基因组上,这些基因组是完整的基因序列,以重建样本DNA的序列。这种短读映射正在成为NGS系统的瓶颈。在本文中,我们提出了一个基于哈希索引方法的映射FPGA系统。在我们的系统中,短读被分成种子,种子是用于映射的固定长度的子字符串,种子使用bucket进行排序。然后,将每个桶中的种子与候选位置并行比较。通过这种方法,可以以大规模并行的方式将许多种子与其候选位置进行比较,并且可以通过减少对存储候选位置的DRAM库的随机访问次数来提高处理速度。此外,在这种平行比较中,可以允许对种子中的核苷酸进行替换。这使得实现比以前的工作更高的匹配率成为可能。
{"title":"An acceleration method of short read mapping using FPGA","authors":"Y. Sogabe, T. Maruyama","doi":"10.1109/FPT.2013.6718385","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718385","url":null,"abstract":"The rapid development of Next Generation Sequencing (NGS) has enabled to generate more than 100G base pairs per day from one machine. The produced data are randomly fragmented DNA base pair strings, called short reads, and millions of short reads are mapped onto the reference genomes, which are complete genetic sequences, to reconstruct the sequence of the sample DNA. This short read mapping is becoming the bottle-neck of NGS systems. In this paper, we propose an FPGA system for the mapping based on a hash-index method. In our system, short reads are divided into seeds, which are fixed-length substrings used for the mapping, and the seeds are sorted using buckets. Then, the seeds in each bucket are compared in parallel with the candidate locations. With this approach, many seeds can be compared in massively parallel manner with their candidate locations, and it becomes possible to improve the processing speed by reducing the number of the random accesses to DRAM banks which store the candidate locations. Furthermore, substitutions of the nucleotides in a seed can be allowed in this parallel comparison. This makes it possible to achieve higher matching rates than previous works.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125019670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Accelerating validation of time-triggered automotive systems on FPGAs fpga上加速时间触发汽车系统的验证
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718322
Shanker Shreejith, Suhaib A. Fahmy, M. Lukasiewycz
Automotive systems comprise a high number of networked safety-critical functions. Any design changes or addition of new functionality must be rigorously tested to ensure that no performance or safety issues are introduced, and this consumes a significant amount of time. Validation should be conducted using a faithful representation of the system, and so typically, a full subsystem is built for validation. We present a scalable scheme for emulating a complete cluster of automotive embedded compute units on an FPGA, with accelerated network communication using custom physical level interfaces. With these interfaces, we can achieve acceleration of system emulation by 8× or more, with a systematic way of exploring real-world issues like jitter, network delays, and data corruption, among others. By using the same communication infrastructure as in a real deployed system, this validation is closer to the requirements of standards compliance. This approach also enables hardware-in-the-loop (HIL) validation, allowing rapid prototyping of distributed functions, including changes in network topology and parameters, and modification of time-triggered schedules without physical hardware modification. We present an implementation of this framework on the Xilinx ML605 evaluation board that integrates six FlexRay automotive functions to demonstrate the potential of the framework.
汽车系统包含大量网络安全关键功能。任何设计更改或新功能的添加都必须经过严格的测试,以确保不会引入性能或安全问题,这将消耗大量的时间。应该使用系统的忠实表示来执行验证,因此,通常要为验证构建一个完整的子系统。我们提出了一种可扩展的方案,用于在FPGA上模拟完整的汽车嵌入式计算单元集群,并使用自定义物理层接口加速网络通信。有了这些接口,我们可以通过系统的方式来探索诸如抖动、网络延迟和数据损坏等现实问题,从而实现8倍或更多的系统仿真加速。通过在实际部署的系统中使用相同的通信基础设施,这种验证更接近标准遵从性的需求。这种方法还支持硬件在环(HIL)验证,允许分布式功能的快速原型化,包括网络拓扑和参数的更改,以及在不修改物理硬件的情况下修改时间触发的调度。我们在Xilinx ML605评估板上展示了该框架的实现,该评估板集成了六个FlexRay汽车功能,以展示该框架的潜力。
{"title":"Accelerating validation of time-triggered automotive systems on FPGAs","authors":"Shanker Shreejith, Suhaib A. Fahmy, M. Lukasiewycz","doi":"10.1109/FPT.2013.6718322","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718322","url":null,"abstract":"Automotive systems comprise a high number of networked safety-critical functions. Any design changes or addition of new functionality must be rigorously tested to ensure that no performance or safety issues are introduced, and this consumes a significant amount of time. Validation should be conducted using a faithful representation of the system, and so typically, a full subsystem is built for validation. We present a scalable scheme for emulating a complete cluster of automotive embedded compute units on an FPGA, with accelerated network communication using custom physical level interfaces. With these interfaces, we can achieve acceleration of system emulation by 8× or more, with a systematic way of exploring real-world issues like jitter, network delays, and data corruption, among others. By using the same communication infrastructure as in a real deployed system, this validation is closer to the requirements of standards compliance. This approach also enables hardware-in-the-loop (HIL) validation, allowing rapid prototyping of distributed functions, including changes in network topology and parameters, and modification of time-triggered schedules without physical hardware modification. We present an implementation of this framework on the Xilinx ML605 evaluation board that integrates six FlexRay automotive functions to demonstrate the potential of the framework.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121239079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Debugging processors with advanced features by reprogramming LUTs on FPGA 通过在FPGA上重新编程lut来调试具有高级功能的处理器
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718329
Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita
In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.
在本文中,我们提出了一种自动化的方法来调试和纠正在fpga上实现的处理器中的逻辑错误。我们的方法是基于保留当前电路拓扑结构,并通过仅更改lut的内容来调试和纠正错误,而不修改布线。因此,纠正错误不需要重新合成,由于可能存在计时闭包问题,这对于复杂的处理器来说非常耗时。由于电路的拓扑结构被保留,修正错误不会影响电路的时序。在设计阶段,我们可以在原电路中增加额外的lut或额外的输入,以便我们可以在调试和整流阶段使用它们。在发现错误之后,首先我们尝试识别候选信号以及纠正其行为所需的更改。这是通过在微体系结构层面上对处理器的指令集体系结构模型和错误模型进行符号模拟和等价检验来实现的。然后,我们尝试将校正后的功能映射到现有的LUT拓扑中。这是通过一种新的方法来实现的,该方法将问题表述为QBF(量化布尔公式)问题,并通过重复地增量应用常规SAT求解器来解决问题,而不是利用CEGAR(反例引导抽象细化)范式的思想来解决问题。通过采用定时错误恢复机制对两个复杂无序超标量处理器的错误进行校正,证明了该方法的有效性和高效性。
{"title":"Debugging processors with advanced features by reprogramming LUTs on FPGA","authors":"Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita","doi":"10.1109/FPT.2013.6718329","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718329","url":null,"abstract":"In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A defect-tolerant cluster in a mesh SRAM-based FPGA 基于网格sram的FPGA中的容错集群
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718407
Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi
In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.
在本文中,我们提出在基于sram的FPGA上实现多种容错技术。这些技术包括逻辑块和集群内互连的冗余。在逻辑块中,冗余在多路复用器级别实现。在考虑所有可能的位置和输入组合的情况下,通过在多路复用器的输出端注入单个缺陷来分析其效率。而在互连层,引入细粒度冗余,不仅绕过了缺陷,而且提高了可达性。利用稀疏的簇内互连结构,通过有效的反馈路径分配进一步提高了可达性,从而使逻辑块之间的连接更加灵活。仿真结果表明,该方法对逻辑块和簇内互连的鲁棒性分别提高了15%和34%。此外,还从最大可实现故障覆盖率和各自成本的角度研究了这些强化方案对FPGA集群制造缺陷可测试性的影响。
{"title":"A defect-tolerant cluster in a mesh SRAM-based FPGA","authors":"Arwa Ben Dhia, S. Rehman, Adrien Blanchardon, L. Naviner, M. Benabdenbi, R. Chotin-Avot, Emna Amouri, H. Mehrez, Z. Marrakchi","doi":"10.1109/FPT.2013.6718407","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718407","url":null,"abstract":"In this paper, we propose the implementation of multiple defect-tolerant techniques on an SRAM-based FPGA. These techniques include redundancy at both the logic block and intra-cluster interconnect. In the logic block, redundancy is implemented at the multiplexer level. Its efficiency is analyzed by injecting a single defect at the output of a multiplexer, considering all possible locations and input combinations. While at the interconnect level, fine grain redundancy is introduced which not only bypasses defects but also increases routability. Taking advantage of the sparse intra-cluster interconnect structures, routability is further improved by efficient distribution of feedback paths allowing more flexibility in the connections among logic blocks. Emulation results show a significant improvement of about 15% and 34% in the robustness of logic block and intra-cluster interconnect respectively. Furthermore, the impact of these hardening schemes on the testability of the FPGA cluster for manufacturing defects is also investigated in terms of maximum achievable fault coverage and the respective cost.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129381157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2013 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1