首页 > 最新文献

2014 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
A dataflow system for anomaly detection and analysis 一个用于异常检测和分析的数据流系统
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082793
A. Bara, Xinyu Niu, W. Luk
This paper proposes DeADA, a dataflow architecture incorporating an automated, unsupervised and online learning algorithm. Compared with 24 core software implementations, DeADA achieves up to 6.17 times lower data drop rate and 10.7 times higher power efficiency. More importantly, experimental results for the Heartbleed case study suggest that DeADA is capable of detecting unknown attacks under network speeds of at least 18Mbps, a feature which is essential for modern network intrusion detection.
本文提出了一种包含自动、无监督和在线学习算法的数据流体系结构DeADA。与24核软件实现相比,DeADA的数据丢失率降低了6.17倍,功耗效率提高了10.7倍。更重要的是,心脏出血案例研究的实验结果表明,DeADA能够在至少18Mbps的网络速度下检测未知攻击,这是现代网络入侵检测必不可少的功能。
{"title":"A dataflow system for anomaly detection and analysis","authors":"A. Bara, Xinyu Niu, W. Luk","doi":"10.1109/FPT.2014.7082793","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082793","url":null,"abstract":"This paper proposes DeADA, a dataflow architecture incorporating an automated, unsupervised and online learning algorithm. Compared with 24 core software implementations, DeADA achieves up to 6.17 times lower data drop rate and 10.7 times higher power efficiency. More importantly, experimental results for the Heartbleed case study suggest that DeADA is capable of detecting unknown attacks under network speeds of at least 18Mbps, a feature which is essential for modern network intrusion detection.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"63 1","pages":"276-279"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85170374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
FPGA-accelerated Monte-Carlo integration using stratified sampling and Brownian bridges 使用分层采样和布朗桥的fpga加速蒙特卡罗积分
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082755
M. D. Jong, V. Sima, K. Bertels, David B. Thomas
Monte-Carlo Integration (MCI) is a numerical technique for evaluating integrals which have no closed form solution. Naive MCI randomly samples the integrand at uniformly distributed points. This naive approach converges very slowly. Stratified sampling can be used to concentrate the samples on segments of the integration domain where the integrand has the highest variance. Even with stratified sampling, MCI converges very slowly for multidimensional integrals. In this work, we implement an FPGA-accelerated design for MISER, a widely used adaptive MCI algorithm applying stratified sampling. We show how to eliminate the recursion from MISER and partition the algorithm between CPUs and FPGAs. The CPUs manage the control-heavy stratification strategy, while the FPGA is responsible for sampling the integrand. The integrand is compiled into a deep pipeline on the FPGA, producing one function evaluation per clock cycle. We demonstrate the FPGA-accelerated design by pricing a path dependent financial derivative called an Asian option. To make optimal use of the stratification, we implement a Brownian bridge on the FPGA that produces one entire bridge per clock cycle. The FPGA-accelerated design is up to 880 times faster compared to a software reference using the GSL implementation of MISER. Compared to naive MCI in software, our design even requires up to 3572 times less execution time to achieve the same accuracy.
蒙特卡罗积分法(MCI)是一种计算无闭形式解的积分的数值方法。朴素MCI对均匀分布的被积点进行随机采样。这种天真的方法收敛得很慢。分层抽样可以用来将样本集中在积分域中被积量方差最大的部分。即使使用分层抽样,MCI对多维积分的收敛速度也很慢。在这项工作中,我们实现了一个fpga加速设计的MISER,一个广泛使用的自适应MCI算法应用分层采样。我们展示了如何消除MISER的递归,并在cpu和fpga之间划分算法。cpu管理重控制的分层策略,而FPGA负责对被积体进行采样。被积件在FPGA上编译成一个深管道,每个时钟周期产生一个函数评估。我们通过对一种称为亚洲期权的依赖于路径的金融衍生品定价来演示fpga加速设计。为了最佳地利用分层,我们在FPGA上实现了一个布朗桥,每个时钟周期产生一个完整的桥。与使用MISER的GSL实现的软件参考相比,fpga加速设计的速度高达880倍。与软件中的原始MCI相比,我们的设计甚至需要多达3572倍的执行时间来达到相同的精度。
{"title":"FPGA-accelerated Monte-Carlo integration using stratified sampling and Brownian bridges","authors":"M. D. Jong, V. Sima, K. Bertels, David B. Thomas","doi":"10.1109/FPT.2014.7082755","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082755","url":null,"abstract":"Monte-Carlo Integration (MCI) is a numerical technique for evaluating integrals which have no closed form solution. Naive MCI randomly samples the integrand at uniformly distributed points. This naive approach converges very slowly. Stratified sampling can be used to concentrate the samples on segments of the integration domain where the integrand has the highest variance. Even with stratified sampling, MCI converges very slowly for multidimensional integrals. In this work, we implement an FPGA-accelerated design for MISER, a widely used adaptive MCI algorithm applying stratified sampling. We show how to eliminate the recursion from MISER and partition the algorithm between CPUs and FPGAs. The CPUs manage the control-heavy stratification strategy, while the FPGA is responsible for sampling the integrand. The integrand is compiled into a deep pipeline on the FPGA, producing one function evaluation per clock cycle. We demonstrate the FPGA-accelerated design by pricing a path dependent financial derivative called an Asian option. To make optimal use of the stratification, we implement a Brownian bridge on the FPGA that produces one entire bridge per clock cycle. The FPGA-accelerated design is up to 880 times faster compared to a software reference using the GSL implementation of MISER. Compared to naive MCI in software, our design even requires up to 3572 times less execution time to achieve the same accuracy.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"40 19 1","pages":"68-75"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89238331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Using C to implement high-efficient computation of dense optical flow on FPGA-accelerated heterogeneous platforms 利用C语言在fpga加速异构平台上实现密集光流的高效计算
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082789
Zhilei Chai, Haojie Zhou, Zhibin Wang, Dong Wu
High-quality algorithms for dense optical flow computation are computationally intensive. To compute them with high speed and low power is vital to make optical flow computation applicable in real-world applications. In contrast to only the Horn-Schunck model being studied on FPGA-based systems today, one of the best linear variational methods for dense optical flow computation, Combine-Brightness-Gradient, is implemented on FPGA-accelerated heterogeneous platforms in this paper. C instead of HDLs is employed and optimizing techniques based on the algorithmic parallelism and hardware architecture are introduced. Experimental results show that 30-110x improvement of the computing efficiency over CPUs was achieved. The FPGA-accelerated version is able to process 640 × 480 image at 12 fps with 0.38 J per frame, while it is 0.8 fps and around 40 J on CPUs. Through demonstrating high performance and low power of dense optical flow algorithm on FPGA-based heterogeneous platforms implemented in C, this paper shows that the off-the-shelf commodity FPGAs coupled with High-Level-Synthesis (HLS) tools could provide an available option when computational efficiency together with development speed are required.
高质量的密集光流计算算法需要大量的计算。高速、低功耗地计算它们是使光流计算应用于实际应用的关键。与目前仅在基于fpga的系统上研究Horn-Schunck模型相比,本文在fpga加速的异构平台上实现了密集光流计算的最佳线性变分方法之一组合亮度梯度。采用C语言代替hdl,并介绍了基于算法并行性和硬件结构的优化技术。实验结果表明,与cpu相比,计算效率提高了30-110倍。fpga加速版本能够以12帧/秒的速度处理640 × 480图像,每帧0.38 J,而在cpu上它是0.8帧/秒,大约40 J。通过在C语言实现的基于fpga的异构平台上演示密集光流算法的高性能和低功耗,本文表明,当需要计算效率和开发速度时,结合高级合成(high - level synthesis, HLS)工具的现成商品fpga可以提供一种可用的选择。
{"title":"Using C to implement high-efficient computation of dense optical flow on FPGA-accelerated heterogeneous platforms","authors":"Zhilei Chai, Haojie Zhou, Zhibin Wang, Dong Wu","doi":"10.1109/FPT.2014.7082789","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082789","url":null,"abstract":"High-quality algorithms for dense optical flow computation are computationally intensive. To compute them with high speed and low power is vital to make optical flow computation applicable in real-world applications. In contrast to only the Horn-Schunck model being studied on FPGA-based systems today, one of the best linear variational methods for dense optical flow computation, Combine-Brightness-Gradient, is implemented on FPGA-accelerated heterogeneous platforms in this paper. C instead of HDLs is employed and optimizing techniques based on the algorithmic parallelism and hardware architecture are introduced. Experimental results show that 30-110x improvement of the computing efficiency over CPUs was achieved. The FPGA-accelerated version is able to process 640 × 480 image at 12 fps with 0.38 J per frame, while it is 0.8 fps and around 40 J on CPUs. Through demonstrating high performance and low power of dense optical flow algorithm on FPGA-based heterogeneous platforms implemented in C, this paper shows that the off-the-shelf commodity FPGAs coupled with High-Level-Synthesis (HLS) tools could provide an available option when computational efficiency together with development speed are required.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"63 1","pages":"260-263"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75331649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Network recorder and player: FPGA-based network traffic capture and replay 网络记录器和播放器:基于fpga的网络流量捕获和重放
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082815
Siyi Qiao, Chen Xu, Lei Xie, Ji Yang, Chengchen Hu, X. Guan, Jianhua Zou
An appropriate tool to generate real network traffic plays an important role in testing network system. Traditionally, such a tool relies on software solutions that copies data back and forth between different part of memory to capture or replay network traffic. In this paper, we propose an FPGA-centric approach using parallel logic, which can ensure high accuracy of time and high throughput. We first design an FPGA add-on board dealing with the multifarious work like adding content or calculate statistical value. The system is implemented on an own designed off-the-shelf FPGA network add-on card to demonstrate the viability of our assumption. Experiments demonstrate reasonable performance improvement (higher throughput and replay time precision) when compared with software based solutions.
一个合适的工具来生成真实的网络流量在测试网络系统中起着重要的作用。传统上,这种工具依赖于在内存的不同部分之间来回复制数据的软件解决方案来捕获或重放网络流量。在本文中,我们提出了一种以fpga为中心,使用并行逻辑的方法,可以保证高时间精度和高吞吐量。我们首先设计了一个FPGA附加板来处理各种工作,如添加内容或计算统计值。该系统在自己设计的现成FPGA网络附加卡上实现,以证明我们假设的可行性。实验表明,与基于软件的解决方案相比,性能得到了合理的改善(更高的吞吐量和重放时间精度)。
{"title":"Network recorder and player: FPGA-based network traffic capture and replay","authors":"Siyi Qiao, Chen Xu, Lei Xie, Ji Yang, Chengchen Hu, X. Guan, Jianhua Zou","doi":"10.1109/FPT.2014.7082815","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082815","url":null,"abstract":"An appropriate tool to generate real network traffic plays an important role in testing network system. Traditionally, such a tool relies on software solutions that copies data back and forth between different part of memory to capture or replay network traffic. In this paper, we propose an FPGA-centric approach using parallel logic, which can ensure high accuracy of time and high throughput. We first design an FPGA add-on board dealing with the multifarious work like adding content or calculate statistical value. The system is implemented on an own designed off-the-shelf FPGA network add-on card to demonstrate the viability of our assumption. Experiments demonstrate reasonable performance improvement (higher throughput and replay time precision) when compared with software based solutions.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"45 2 1","pages":"342-345"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78137756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An efficient FPGA implementation of QR decomposition using a novel systolic array architecture based on enhanced vectoring CORDIC 一种基于增强矢量CORDIC的新型收缩阵列结构的QR分解的高效FPGA实现
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082764
Jianfeng Zhang, P. Chow, Hengzhu Liu
Multiple input multiple output (MIMO) - Orthogonal frequency division multiplexing (OFDM) systems typically use Orthogonal-triangular (QR) decomposition. In this paper, we present a novel systolic array architecture to realize QR decomposition based on the Givens rotation method for a 4 × 4 real matrix. The coordinate rotation digital computer (CORDIC) algorithm is adopted and modified to speed up and simplify the Givens rotation. To verify the function and evaluate the performance, the proposed architectures are validated on a Virtex 5 FPGA development platform. Compared to a commercial implementation of vectoring CORDIC, an enhanced vectoring CORDIC is presented that uses 37.7% less hardware resources, dissipates 76.8% less power and provides a 1.8 times speed-up while maintaining the same computation accuracy. The novel QR systolic array architecture based on the enhanced vectoring CORDIC saves 5% in hardware and the throughput is improved by a factor of two with no accuracy penalty when compared with the best previous version of the QR systolic array.
多输入多输出(MIMO) -正交频分复用(OFDM)系统通常使用正交三角形(QR)分解。本文提出了一种基于Givens旋转法实现4 × 4实矩阵QR分解的收缩阵列结构。采用坐标旋转数字计算机(CORDIC)算法,并对其进行改进,以加快和简化Givens旋转。为了验证功能和评估性能,在Virtex 5 FPGA开发平台上对所提出的架构进行了验证。与商业实现的矢量CORDIC相比,增强的矢量CORDIC使用的硬件资源减少了37.7%,功耗减少了76.8%,在保持相同计算精度的情况下提供了1.8倍的速度提升。基于增强矢量CORDIC的新型QR收缩压阵列结构与之前最好的QR收缩压阵列相比,节省了5%的硬件,吞吐量提高了两倍,且没有精度损失。
{"title":"An efficient FPGA implementation of QR decomposition using a novel systolic array architecture based on enhanced vectoring CORDIC","authors":"Jianfeng Zhang, P. Chow, Hengzhu Liu","doi":"10.1109/FPT.2014.7082764","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082764","url":null,"abstract":"Multiple input multiple output (MIMO) - Orthogonal frequency division multiplexing (OFDM) systems typically use Orthogonal-triangular (QR) decomposition. In this paper, we present a novel systolic array architecture to realize QR decomposition based on the Givens rotation method for a 4 × 4 real matrix. The coordinate rotation digital computer (CORDIC) algorithm is adopted and modified to speed up and simplify the Givens rotation. To verify the function and evaluate the performance, the proposed architectures are validated on a Virtex 5 FPGA development platform. Compared to a commercial implementation of vectoring CORDIC, an enhanced vectoring CORDIC is presented that uses 37.7% less hardware resources, dissipates 76.8% less power and provides a 1.8 times speed-up while maintaining the same computation accuracy. The novel QR systolic array architecture based on the enhanced vectoring CORDIC saves 5% in hardware and the throughput is improved by a factor of two with no accuracy penalty when compared with the best previous version of the QR systolic array.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"35 1","pages":"123-130"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85145372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Zero latency encryption with FPGAs for secure time-triggered automotive networks 零延迟加密与fpga的安全时间触发汽车网络
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082788
Shanker Shreejith, Suhaib A. Fahmy
Security has emerged as a key concern in increasingly complex embedded automotive networks. The distributed architecture and broadcast transmission characteristics mean they are vulnerable and provide little resistance to intrusive and non-intrusive attack mechanisms. Incorporating data security using traditional approaches introduces significant latency which can be problematic in the presence of real-time deadlines. We demonstrate how a security layer can be added within the network communication controller in modern time-triggered systems, without introducing additional latency or processing overheads. This allows critical communications to be secured in a manner that is transparent to the processors in the electronic control units (ECUs), while also safeguarding network communication properties.
在日益复杂的嵌入式汽车网络中,安全性已成为一个关键问题。分布式架构和广播传输特性意味着它们很脆弱,对侵入性和非侵入性攻击机制的抵抗力很小。使用传统方法整合数据安全性会引入明显的延迟,这在存在实时截止日期时可能会产生问题。我们演示了如何在现代时间触发系统的网络通信控制器中添加安全层,而不会引入额外的延迟或处理开销。这使得关键通信以一种对电子控制单元(ecu)中的处理器透明的方式得到保护,同时也保护了网络通信属性。
{"title":"Zero latency encryption with FPGAs for secure time-triggered automotive networks","authors":"Shanker Shreejith, Suhaib A. Fahmy","doi":"10.1109/FPT.2014.7082788","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082788","url":null,"abstract":"Security has emerged as a key concern in increasingly complex embedded automotive networks. The distributed architecture and broadcast transmission characteristics mean they are vulnerable and provide little resistance to intrusive and non-intrusive attack mechanisms. Incorporating data security using traditional approaches introduces significant latency which can be problematic in the presence of real-time deadlines. We demonstrate how a security layer can be added within the network communication controller in modern time-triggered systems, without introducing additional latency or processing overheads. This allows critical communications to be secured in a manner that is transparent to the processors in the electronic control units (ECUs), while also safeguarding network communication properties.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"33 1","pages":"256-259"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80431527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Collaborative processing of Least-Square Monte Carlo for American options 美国期权的最小二乘蒙特卡罗协同处理
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082753
Jinzhe Yang, Ce Guo, W. Luk, Terence Nahar
American options are popularly traded in the financial market, so pricing those options becomes crucial in practice. In reality, many popular pricing models do not have analytical solutions. Hence techniques such as Monte Carlo are often used in practice. This paper presents a CPU-FPGA collaborative accelerator using state-of-the-art Least-Square Monte Carlo method, for pricing American options. We provide a new sequence of generating the Monte Carlo paths, and a precalculation strategy for the regression process. Our design is customisable for different pricing models, discretisation schemes, and regression functions. The Heston model is used as a case study for evaluating our strategy. Experimental results show that an FPGA-based solution could provide 22 to 64.5 times faster than a single-core CPU implementation.
美国期权在金融市场上广泛交易,因此期权的定价在实践中变得至关重要。在现实中,许多流行的定价模型没有分析解决方案。因此,在实践中经常使用蒙特卡罗等技术。本文提出了一种基于最小二乘蒙特卡罗方法的CPU-FPGA协同加速器,用于美式期权定价。我们提供了一种新的蒙特卡罗路径生成序列,以及一种回归过程的预计算策略。我们的设计可针对不同的定价模型、离散化方案和回归函数进行定制。赫斯顿模型被用作评估我们战略的案例研究。实验结果表明,基于fpga的解决方案可以提供比单核CPU实现快22到64.5倍的速度。
{"title":"Collaborative processing of Least-Square Monte Carlo for American options","authors":"Jinzhe Yang, Ce Guo, W. Luk, Terence Nahar","doi":"10.1109/FPT.2014.7082753","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082753","url":null,"abstract":"American options are popularly traded in the financial market, so pricing those options becomes crucial in practice. In reality, many popular pricing models do not have analytical solutions. Hence techniques such as Monte Carlo are often used in practice. This paper presents a CPU-FPGA collaborative accelerator using state-of-the-art Least-Square Monte Carlo method, for pricing American options. We provide a new sequence of generating the Monte Carlo paths, and a precalculation strategy for the regression process. Our design is customisable for different pricing models, discretisation schemes, and regression functions. The Heston model is used as a case study for evaluating our strategy. Experimental results show that an FPGA-based solution could provide 22 to 64.5 times faster than a single-core CPU implementation.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"08 1","pages":"52-59"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86368483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Size aware placement for island style FPGAs 岛式fpga的尺寸感知放置
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082749
Junying Huang, C. Y. Lin, Yang Liu, Zhihua Li, Haigang Yang
In this paper we first examine the impact of FPGA size on overall performance and run-time of placement and routing in the context of cluster-based island-style FPGAs. Based on the observations, an FPGA placement algorithm, Min-Size, is introduced to alleviate the deterioration of performance and run-time of placement and routing when using a large FPGA to implement a circuit. We achieve this by allowing Min-Size to generate a more compact placement of logic, I/O and hard blocks. Our experimental results have shown a 3X and 4X speedup in placement and routing run-time, a 38% and 41% reduction in wire length, and a 8% and 5% improvement in critical path delay when FPGA size increases 10 times.
在本文中,我们首先研究了FPGA尺寸对基于集群的孤岛式FPGA的整体性能和放置和路由的运行时间的影响。在此基础上,引入了一种FPGA布局算法Min-Size,以缓解使用大型FPGA实现电路时布局和路由的性能和运行时间的下降。我们通过允许Min-Size生成更紧凑的逻辑、I/O和硬块放置来实现这一点。我们的实验结果表明,当FPGA尺寸增加10倍时,放置和路由运行时间加快了3倍和4倍,导线长度减少了38%和41%,关键路径延迟提高了8%和5%。
{"title":"Size aware placement for island style FPGAs","authors":"Junying Huang, C. Y. Lin, Yang Liu, Zhihua Li, Haigang Yang","doi":"10.1109/FPT.2014.7082749","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082749","url":null,"abstract":"In this paper we first examine the impact of FPGA size on overall performance and run-time of placement and routing in the context of cluster-based island-style FPGAs. Based on the observations, an FPGA placement algorithm, Min-Size, is introduced to alleviate the deterioration of performance and run-time of placement and routing when using a large FPGA to implement a circuit. We achieve this by allowing Min-Size to generate a more compact placement of logic, I/O and hard blocks. Our experimental results have shown a 3X and 4X speedup in placement and routing run-time, a 38% and 41% reduction in wire length, and a 8% and 5% improvement in critical path delay when FPGA size increases 10 times.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 5 1","pages":"28-35"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83495011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Is high level synthesis ready for business? A computational finance case study 高级合成是否已准备就绪?计算金融案例研究
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082747
G. Inggs, Shane T. Fleming, David B. Thomas, W. Luk
High Level Synthesis (HLS) tools for Field Programmable Gate Arrays (FPGAs) have made considerable progress, and are now sufficiently mature that a novice developer could create functionally correct implementation with limited understanding of the target hardware. In this case study, a novice developer considers a benchmark of financial problems for implementation upon FPGA via HLS. This novice starts by extending an existing implementation for a CPU or GPU using tools such as Xilinx's Vivado HLS, the Altera OpenCL SDK or Maxeler's MaxCompiler. When their direct source code translation inevitably didn't meet performance expectations, this developer then applies optimisations such as exploiting task or pipeline parallelism as well as C-slowing. When a combination of these optimisations are considered for a range of devices and process technologies, an acceleration of up to 220 times is achieved using these tools, the sort of acceleration expected of custom architectures. Compared to the 31 times improvement shown by an optimised Multicore CPU implementation, the 60 times improvement by a GPU and 207 times by a Xeon Phi, these results suggest that HLS is indeed ready for industrial adoption.
用于现场可编程门阵列(fpga)的高级综合(HLS)工具已经取得了相当大的进展,并且现在已经足够成熟,以至于新手开发人员可以在对目标硬件了解有限的情况下创建功能正确的实现。在本案例研究中,新手开发人员考虑通过HLS在FPGA上实现财务问题的基准。这个新手首先使用Xilinx的Vivado HLS、Altera OpenCL SDK或Maxeler的MaxCompiler等工具扩展CPU或GPU的现有实现。当他们的直接源代码翻译不可避免地不能满足性能期望时,该开发人员就会应用优化,例如利用任务或管道并行性以及c减慢。当考虑将这些优化组合用于一系列设备和工艺技术时,使用这些工具可实现高达220倍的加速,这是自定义架构所期望的那种加速。与优化后的多核CPU实现的31倍改进、GPU的60倍改进和Xeon Phi的207倍改进相比,这些结果表明HLS确实已经为工业应用做好了准备。
{"title":"Is high level synthesis ready for business? A computational finance case study","authors":"G. Inggs, Shane T. Fleming, David B. Thomas, W. Luk","doi":"10.1109/FPT.2014.7082747","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082747","url":null,"abstract":"High Level Synthesis (HLS) tools for Field Programmable Gate Arrays (FPGAs) have made considerable progress, and are now sufficiently mature that a novice developer could create functionally correct implementation with limited understanding of the target hardware. In this case study, a novice developer considers a benchmark of financial problems for implementation upon FPGA via HLS. This novice starts by extending an existing implementation for a CPU or GPU using tools such as Xilinx's Vivado HLS, the Altera OpenCL SDK or Maxeler's MaxCompiler. When their direct source code translation inevitably didn't meet performance expectations, this developer then applies optimisations such as exploiting task or pipeline parallelism as well as C-slowing. When a combination of these optimisations are considered for a range of devices and process technologies, an acceleration of up to 220 times is achieved using these tools, the sort of acceleration expected of custom architectures. Compared to the 31 times improvement shown by an optimised Multicore CPU implementation, the 60 times improvement by a GPU and 207 times by a Xeon Phi, these results suggest that HLS is indeed ready for industrial adoption.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"66 1","pages":"12-19"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86139429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Hardware/software co-design architecture for Blokus Duo solver Blokus Duo解算器的硬件/软件协同设计架构
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082820
N. Sugimoto, H. Amano
This paper presents a software and hardware design of an FPGA-based Blokus Duo solver. We used Embedded system called ZYNQ-7000 All Programmable SoC to implement the solver. By combining hardware with software, efficient acceleration is performed. Our system searches a game tree by using the miniMax algorithm with alpha-beta pruning. The implemented solver works at 75MHz with Xilinx Zynq-7000 AP SoC XC7Z020-CLG484 on the Digilent ZedBoard. It can search states after three moves in most cases.
本文介绍了一种基于fpga的Blokus Duo求解器的软硬件设计。我们使用嵌入式系统ZYNQ-7000全可编程SoC来实现求解器。通过硬件与软件的结合,实现了高效的加速。我们的系统使用带有alpha-beta修剪的miniMax算法来搜索游戏树。实现的求解器工作在75MHz,在Digilent ZedBoard上使用Xilinx Zynq-7000 AP SoC XC7Z020-CLG484。在大多数情况下,它可以在移动三步后搜索状态。
{"title":"Hardware/software co-design architecture for Blokus Duo solver","authors":"N. Sugimoto, H. Amano","doi":"10.1109/FPT.2014.7082820","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082820","url":null,"abstract":"This paper presents a software and hardware design of an FPGA-based Blokus Duo solver. We used Embedded system called ZYNQ-7000 All Programmable SoC to implement the solver. By combining hardware with software, efficient acceleration is performed. Our system searches a game tree by using the miniMax algorithm with alpha-beta pruning. The implemented solver works at 75MHz with Xilinx Zynq-7000 AP SoC XC7Z020-CLG484 on the Digilent ZedBoard. It can search states after three moves in most cases.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"358-361"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91201197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2014 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1