2014 International Conference on Field-Programmable Technology (FPT)最新文献

英文中文

Efficient FPGA implementation of digit parallel online arithmetic operators 数字并行在线算术运算符的高效FPGA实现

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082763

Kan Shi, D. Boland, G. Constantinides

Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.

在线算法在ASIC实现中得到了广泛的研究。在线组件最初被设计为以最高有效位数(MSD)优先执行数字串行计算，从而能够将算术运算符链在一起以实现低延迟。最近，研究表明，与传统算法相比，数字并行在线算子在超出确定性时钟区域时可以更优雅地失败。不幸的是，过去在线算术运算符的使用需要很大的FPGA实现面积开销。在本文中，我们提出了新的方法来有效地实现在线算术，加法器和乘法器的关键原语，在具有6输入lut和携带资源的现代赛灵思fpga上。我们通过实验证明，与直接RTL合成相比，所提出的架构分别实现了超过67%和69%的切片节省，加法器和乘法器的加速分别超过1.2倍和1.5倍。因此，使用在线加法器和乘法器代替传统算术原语的面积开销分别从8.41 x和8.11 x减少到1.88x和1.84x。最后，由于在线乘法器首先生成msd，因此我们还演示了创建具有较低精度输出的在线乘法器的方法，该输出比产生相同结果的传统乘法器小。我们表明，这可以节省高达56%的硅面积。

{"title":"Efficient FPGA implementation of digit parallel online arithmetic operators","authors":"Kan Shi, D. Boland, G. Constantinides","doi":"10.1109/FPT.2014.7082763","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082763","url":null,"abstract":"Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"115-122"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79937125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Improving the reliability of RO PUF using frequency offset 利用频偏提高RO PUF的可靠性

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082813

Bin Tang, Yaping Lin, Jiliang Zhang

Physical unclonable function (PUF) is a promising hardware security primitive that can be applied to various security related areas. The ring oscillator (RO) PUF is one of the most popular PUFs that can generate the volatile key by comparing the frequency between ROs. Previous RO PUFs incur unacceptable hardware overheads to improve the reliability in order to eliminate the effect of environment factors. In this paper, we propose a frequency offset algorithm (FOA) to enhance the reliability and low the hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs. Experimental results show that our proposed FOA method has the better reliability and lower hardware overhead than the temperature-aware cooperative (TAC). Especially, our proposed method can achieve the 100% utilization of ROs.

物理不可克隆函数(PUF)是一种很有前途的硬件安全原语，可应用于各种与安全相关的领域。环形振荡器(RO) PUF是最流行的PUF之一，它可以通过比较RO之间的频率来生成易失密钥。为了消除环境因素的影响，以前的RO puf产生了不可接受的硬件开销，以提高可靠性。在本文中，我们提出了一种频率偏移算法(FOA)来提高可靠性和降低硬件开销。关键思想是通过抵消RO对的频率使频率差大于给定的阈值。实验结果表明，与温度感知协同(TAC)方法相比，该方法具有更高的可靠性和更低的硬件开销。特别是，我们提出的方法可以实现100%的活性氧利用率。

引用次数: 9

No zero padded sparse matrix-vector multiplication on FPGAs fpga上无补零稀疏矩阵向量乘法

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082800

Jiasen Huang, Junyan Ren, Wenbo Yin, Lingli Wang

Sparse Matrix-Vector Multiplication (SpMxV) algorithms suffer heavy performance penalties due to irregular memory accesses. In this paper, we introduce a novel compressed element storage (CES) format, in which the additional data structures for indexing are abandoned, and each location associated with the non-zero element of the matrix is now indicated by the name of a variable multiplied by the corresponding element of the vector. To ensure fastest access and parallel access without data hazards, on-chip registers are used exclusively to replace the BRAM or off-chip DRAM/SRAM to hold all the SpMxV data. On-chip DSP resources are fully utilized so as to ensure a maximum number of multipliers concurrently working.

稀疏矩阵向量乘法(SpMxV)算法由于不规律的内存访问而遭受严重的性能损失。在本文中，我们引入了一种新的压缩元素存储(CES)格式，其中放弃了用于索引的额外数据结构，并且与矩阵的非零元素相关的每个位置现在由变量的名称乘以向量的相应元素来表示。为了确保最快的访问和并行访问而没有数据危害，片上寄存器专门用于取代BRAM或片外DRAM/SRAM来保存所有SpMxV数据。充分利用片上DSP资源，保证最大数量的乘数同时工作。

引用次数: 1

A pure-CMOS nonvolatile multi-context configuration memory for dynamically reconfigurable FPGAs 用于动态可重构fpga的纯cmos非易失性多上下文配置存储器

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082778

K. Tatsumura, Masato Oda, S. Yasuda

Multi-context configuration memory stores multiple sets of configuration data and changes the entire configuration of FPGA quickly, enabling enhancement of hardware utilization with dynamic reconfiguration architectures. The memory area for one set of configuration data should be much smaller than the computational resource it controls. In this paper, we propose a pure-CMOS, nonvolatile, and small-footprint multi-context configuration memory. The multi-context memory includes multiple 2Tr nonvolatile memory elements, which are programmed by channel hot-electron injection, and allows context switching in a single clock cycle. A primitive dynamically reconfigurable device having a lookup table and minimum interconnect backed by 16-bit 8-context configuration memory was fabricated by a 0.18 um CMOS process and its functionality was demonstrated. The 2Tr nonvolatile memory element is more than 4 times denser than 6Tr SRAM, enabling achievement of greater logic density. The pure-CMOS and nonvolatile features would enhance the attractiveness of the technology in many applications.

多上下文配置存储器存储多组配置数据，并快速更改FPGA的整个配置，从而通过动态重构架构提高硬件利用率。一组配置数据的内存区域应该比它控制的计算资源小得多。在本文中，我们提出了一种纯cmos，非易失性和小占用的多上下文配置存储器。多上下文存储器包括多个2Tr非易失性存储器元件，其通过通道热电子注入编程，并允许在单个时钟周期内进行上下文切换。采用0.18 um CMOS工艺制作了一个具有查找表和最小互连的原始动态可重构器件，并对其功能进行了验证。2Tr非易失性存储器元件的密度是6Tr SRAM的4倍以上，可以实现更大的逻辑密度。纯cmos和非易失性的特性将增强该技术在许多应用中的吸引力。

引用次数: 5

FPGA implementation of Blokus Duo player using hardware/software co-design 采用硬件/软件协同设计的Blokus Duo播放器FPGA实现

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082825

A. Kojima

Blokus Duo is an abstract strategy game for two players. In this paper, we describe our FPGA implementation of Blokus Duo player for ICFPT2014 design contest, which is the revised version of the previous design for ICPFT2013 design contest. Our design consists of hardware logic part and software part using soft IP processor. The hardware logic part calculates evaluation value of the board status which is a heavy task for the software part. Our implementation uses recursive Alpha-Beta pruning and iteration deepening algorithm by the software part, which are complex to implement as the hardware logic circuit. The current version of our implementation on Xilinx Artix7 can run at 142MHz. The hardware logic part evaluates about 90,000 nodes in one second at the beginning of the game.

《Blokus Duo》是一款面向两名玩家的抽象策略游戏。在本文中，我们描述了我们为ICFPT2014设计竞赛设计的Blokus Duo播放器的FPGA实现，这是ICPFT2013设计竞赛之前设计的修改版本。本设计采用软IP处理器，由硬件逻辑部分和软件部分组成。硬件逻辑部分计算单板状态的评估值，这是软件部分的一项繁重的任务。软件部分采用递归Alpha-Beta剪枝和迭代深化算法，硬件逻辑电路实现起来比较复杂。我们在Xilinx Artix7上实现的当前版本可以运行在142MHz。在游戏开始时，硬件逻辑部分在一秒钟内评估约90,000个节点。

引用次数: 0

RotoRouter: Router support for endpoint-authorized decentralized traffic filtering to prevent DoS attacks RotoRouter:路由器支持端点授权的分散流量过滤，以防止DoS攻击

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082774

Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon

RotoRouter addresses Denial-of-Service (DoS) attacks on networks with a novel protocol and router implementation. Sets of RotoRouters cooperate in detecting and filtering out invalid network traffic before it reaches network endpoints; a new router-enforceable connection protocol queries destination endpoints to authorize traffic flows and uses per-packet digital signatures to distinguish allowed from disallowed connections. A RotoRouter prototype was implemented on a four-port 1000BASE-T NetFPGA-10G platform and supports 1024 simultaneous active connections using 74 BRAMs (less than one quarter of the available NetFPGA-10G BRAMs). It is able to sustain 800 Mbps per port throughputs for 1500B packets with less than 0.3/its latency, even during a DoS attack. With additional logic and memory resources, the required validation and switching operations scale to port speeds in excess of 10 Gbps and links with more than 10,000 active flows.

RotoRouter通过一种新的协议和路由器实现来解决网络上的拒绝服务(DoS)攻击。在到达网络端点之前，一组RotoRouters协同检测并过滤掉无效的网络流量;一种新的路由器强制连接协议查询目标端点来授权流量，并使用每包数字签名来区分允许的连接和不允许的连接。RotoRouter原型在四端口1000BASE-T NetFPGA-10G平台上实现，使用74个bram(不到可用NetFPGA-10G bram的四分之一)支持1024个同时活动连接。即使在DoS攻击期间，它也能够为1500B数据包维持每个端口800mbps的吞吐量，延迟小于0.3/ s。有了额外的逻辑和内存资源，所需的验证和交换操作就可以扩展到端口速度超过10 Gbps和链接超过10,000个活动流。

引用次数: 2

Architectural synthesis of computational pipelines with decoupled memory access 具有解耦内存访问的计算管道的体系结构综合

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082758

Shaoyi Cheng, J. Wawrzynek

As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor-centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.

随着高层次综合(HLS)在FPGA设计人员中逐渐成为主流，它已被证明是快速硬件生成的有效方法。然而，在将计算密集型软件内核卸载到FPGA加速器的背景下，当前的HLS工具并不总是充分利用硬件平台。在本文中，我们提出了一个自动流程来重构和重构以处理器为中心的软件实现，使它们更适合FPGA平台。该方法生成了将内存操作和数据访问与计算解耦的管道。由于有效地利用了内存带宽并提高了对数据访问延迟的容忍度，因此生成的管道具有更好的吞吐量。该方法补充了现有的高级综合工作，简化了具有高性能加速器和通用处理器的异构系统的创建。使用这种方法，对于用C编写的一组非规则算法内核，使用最先进的HLS工具进行直接C到硬件映射，可以观察到3.3到9.1倍的性能改进。

引用次数: 9

Industrial session 工业会议

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-01-01 DOI: 10.1109/fpt.2014.7082827

Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie

Xilinx has developed even more advanced FPGAs and 2nd generation SoCs and 3D ICs to stay a generation ahead, and deliver an extra node worth of performance, power, and integration. The UltraScale architecture was developed to scale from 20nm planar through 16nm and beyond FinFET (FF) technologies, and from monolithic through 3D ICs. In this talk, we will study the cases about Xilinx FPGA in cutting edge applications, also the advantages of UltraScale architecture 2nd generation SoCs, and design tools. IoT and Wearable Applications Enabled by Bluetooth Low Energy (BLE) Solutions Patrick Kane, Cypress Abstract: The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session

赛灵思已经开发出更先进的fpga和第二代soc和3D ic，以保持领先，并提供额外的节点价值，性能，功率和集成。UltraScale架构可扩展到20nm平面到16nm及以上的FinFET (FF)技术，从单片到3D ic。在本次演讲中，我们将研究赛灵思FPGA在前沿应用中的案例，以及UltraScale架构第二代soc的优势和设计工具。Patrick Kane，赛普拉斯(Cypress)摘要:物联网正在发生。最新的标准是低功耗蓝牙(BLE)。这可能是物联网通信的长期答案，也可能不是，但它肯定会成为领先的物联网通信标准。物联网正在发生。最新的标准是低功耗蓝牙(BLE)。这可能是物联网通信的长期答案，也可能不是，但它肯定会成为领先的物联网通信标准。工业会议

引用次数: 1

Time sharing of Runtime Coarse-Grain Reconfigurable Architectures processing elements in multi-process systems 多进程系统中运行时粗粒度可重构体系结构处理元素的时间共享

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-01-01 DOI: 10.1109/FPT.2014.7082757

Benjamin Carrión Schäfer

引用次数: 1

Why Put FPGAs in your CPU socket? 为什么把fpga放在你的CPU插槽?

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718320

P. Chow

Summary form only given. Ever since FPGAs were invented, there has been great interest in using them as computing devices, and with the logic densities of today's devices, many interesting functions have been shown to have significant performance and energy benefits when implemented in FPGAs. However, when an application requires the combination of a high-performance CPU and an FPGA accelerator, the effectiveness of the FPGA is highly determined by the latency and bandwidth between the CPU, the CPU memory system and the FPGA and its memory system. Putting FPGAs into the CPU socket is one way to address this issue. This talk will present the history, the advantages and disadvantages, the challenges, architectures, programming models and applications of "insocket" accelerator systems.

只提供摘要形式。自从fpga被发明以来，人们对使用它们作为计算设备产生了极大的兴趣，并且随着当今设备的逻辑密度，许多有趣的功能在fpga中实现时已被证明具有显着的性能和能源优势。但是，当应用程序需要高性能CPU和FPGA加速器的组合时，FPGA的有效性在很大程度上取决于CPU、CPU内存系统和FPGA及其内存系统之间的延迟和带宽。将fpga放入CPU插槽是解决此问题的一种方法。本讲座将介绍“insocket”加速系统的历史、优缺点、挑战、架构、编程模型和应用。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 International Conference on Field-Programmable Technology (FPT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀