Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献

英文中文

Session details: Tools and models 2 会话细节:工具和模型2

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/3260943

K. Rupnow

引用次数: 0

Pipelining FPPGA-based defect detction in FPDs (abstract only) fpga中基于流水线fppga的缺陷检测(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554729

Lin Meng, K. Matsuyama, Naoto Nojiri, T. Izumi, K. Yamazaki

The real-time detection of defects in Flat-Panel Displays (FPDs) is very important during the production stages. This paper describes the manner in which defects induced by bubbles are detected as fast as possible by using 4-stage image processing pipelines with 3-line buffers on a Field-Programmable Gate Array (FPGA). The image processing consists of reading a Time Delay Integration (TDI) image, Laplacian filtering, binarization, and labeling. TDI is applied to the initial image of the FPD to reduce noises induced when taking the FPD images. Laplacian filtering and binarization are used to detect the edges in the image, and labeling is used to number the objects in the image for defect detection. In the 4-stage pipelining, the first stage reads the TDI image from the Block Random Access Memory (BRAM), the second stage implements Laplacian filtering and binarization, the third stage implements labeling, and the final stage revises the labels and writes them into the BRAM. The target pixel and its eight surrounding neighbors are required during Laplacian filtering, and four neighbors are necessary during labeling. Thus, three line registers (3-line buffer) are used as a general pipeline register between two neighboring stages in our system. The pipelining system accesses these 3-line buffers and runs four image processing steps in parallel. Therefore, the system uses four different addresses to access the BRAM and the 3-line buffers. Further, to facilitate performance comparison, we implemented sequential image processing systems with 3-line buffers on FPGA and CPU software. The experiments reveal that Laplacian filtering, binarization, and labeling for FPD defect detection can be executed in less than 1 ms by using four-stage pipelining on an FPGA, which is 3.62 times faster than the sequential system and 158.7 times faster than the CPU software. The pipelining system is 28% larger as compared to the sequential system in terms of the size of the LUTs.

在平板显示器的生产过程中，缺陷的实时检测是非常重要的。本文描述了在现场可编程门阵列(FPGA)上使用带有3线缓冲的4级图像处理管道尽可能快地检测气泡引起的缺陷的方式。图像处理包括读取时延集成(TDI)图像、拉普拉斯滤波、二值化和标记。将TDI应用于FPD的初始图像，以降低FPD图像拍摄时产生的噪声。利用拉普拉斯滤波和二值化对图像进行边缘检测，利用标记对图像中的物体进行编号，进行缺陷检测。在4个阶段的流水线中，第一阶段从块随机存取存储器(BRAM)读取TDI映像，第二阶段实现拉普拉斯滤波和二值化，第三阶段实现标记，最后阶段修改标签并将其写入BRAM。拉普拉斯滤波时需要目标像素及其周围的8个邻居，标记时需要4个邻居。因此，在我们的系统中，三行寄存器(3行缓冲区)被用作两个相邻阶段之间的一般管道寄存器。流水线系统访问这些3行缓冲区并并行运行四个图像处理步骤。因此，系统使用四个不同的地址来访问BRAM和3行缓冲区。此外，为了便于性能比较，我们在FPGA和CPU软件上实现了具有3线缓冲的顺序图像处理系统。实验结果表明，在FPGA上采用4级流水线，可在1 ms内完成FPD缺陷检测的拉普拉斯滤波、二值化和标记，比顺序系统快3.62倍，比CPU软件快158.7倍。就lut的大小而言，管道系统比顺序系统大28%。

{"title":"Pipelining FPPGA-based defect detction in FPDs (abstract only)","authors":"Lin Meng, K. Matsuyama, Naoto Nojiri, T. Izumi, K. Yamazaki","doi":"10.1145/2554688.2554729","DOIUrl":"https://doi.org/10.1145/2554688.2554729","url":null,"abstract":"The real-time detection of defects in Flat-Panel Displays (FPDs) is very important during the production stages. This paper describes the manner in which defects induced by bubbles are detected as fast as possible by using 4-stage image processing pipelines with 3-line buffers on a Field-Programmable Gate Array (FPGA). The image processing consists of reading a Time Delay Integration (TDI) image, Laplacian filtering, binarization, and labeling. TDI is applied to the initial image of the FPD to reduce noises induced when taking the FPD images. Laplacian filtering and binarization are used to detect the edges in the image, and labeling is used to number the objects in the image for defect detection. In the 4-stage pipelining, the first stage reads the TDI image from the Block Random Access Memory (BRAM), the second stage implements Laplacian filtering and binarization, the third stage implements labeling, and the final stage revises the labels and writes them into the BRAM. The target pixel and its eight surrounding neighbors are required during Laplacian filtering, and four neighbors are necessary during labeling. Thus, three line registers (3-line buffer) are used as a general pipeline register between two neighboring stages in our system. The pipelining system accesses these 3-line buffers and runs four image processing steps in parallel. Therefore, the system uses four different addresses to access the BRAM and the 3-line buffers. Further, to facilitate performance comparison, we implemented sequential image processing systems with 3-line buffers on FPGA and CPU software. The experiments reveal that Laplacian filtering, binarization, and labeling for FPD defect detection can be executed in less than 1 ms by using four-stage pipelining on an FPGA, which is 3.62 times faster than the sequential system and 158.7 times faster than the CPU software. The pipelining system is 28% larger as compared to the sequential system in terms of the size of the LUTs.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129512714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable multi-access flash store for big data analytics 用于大数据分析的可扩展多访问闪存

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554789

S. Jun, Ming Liu, Kermin Fleming, Arvind

For many "Big Data" applications, the limiting factor in performance is often the transportation of large amount of data from hard disks to where it can be processed, i.e. DRAM. In this paper we examine an architecture for a scalable distributed flash store which aims to overcome this limitation in two ways. First, the architecture provides a high-performance, high-capacity, scalable random-access storage. It achieves high-throughput by sharing large numbers of flash chips across a low-latency, chip-to-chip backplane network managed by the flash controllers. The additional latency for remote data access via this network is negligible as compared to flash access time. Second, it permits some computation near the data via a FPGA-based programmable flash controller. The controller is located in the datapath between the storage and the host, and provides hardware acceleration for applications without any additional latency. We have constructed a small-scale prototype whose network bandwidth scales directly with the number of nodes, and where average latency for user software to access flash store is less than 70mus, including 3.5mus of network overhead.

对于许多“大数据”应用程序，性能的限制因素通常是将大量数据从硬盘传输到可以处理数据的地方，即DRAM。在本文中，我们研究了一个可扩展的分布式闪存存储架构，该架构旨在通过两种方式克服这一限制。首先，该架构提供了高性能、高容量、可扩展的随机存取存储。它通过在由闪存控制器管理的低延迟芯片对芯片背板网络上共享大量闪存芯片来实现高吞吐量。与flash访问时间相比，通过该网络进行远程数据访问的额外延迟可以忽略不计。其次，它允许通过基于fpga的可编程闪存控制器对数据进行一些计算。控制器位于存储和主机之间的数据路径中，为应用程序提供硬件加速，而没有任何额外的延迟。我们构建了一个小规模的原型，其网络带宽与节点数量成正比，用户软件访问闪存的平均延迟小于70mus，其中包括3.5mus的网络开销。

{"title":"Scalable multi-access flash store for big data analytics","authors":"S. Jun, Ming Liu, Kermin Fleming, Arvind","doi":"10.1145/2554688.2554789","DOIUrl":"https://doi.org/10.1145/2554688.2554789","url":null,"abstract":"For many \"Big Data\" applications, the limiting factor in performance is often the transportation of large amount of data from hard disks to where it can be processed, i.e. DRAM. In this paper we examine an architecture for a scalable distributed flash store which aims to overcome this limitation in two ways. First, the architecture provides a high-performance, high-capacity, scalable random-access storage. It achieves high-throughput by sharing large numbers of flash chips across a low-latency, chip-to-chip backplane network managed by the flash controllers. The additional latency for remote data access via this network is negligible as compared to flash access time. Second, it permits some computation near the data via a FPGA-based programmable flash controller. The controller is located in the datapath between the storage and the host, and provides hardware acceleration for applications without any additional latency. We have constructed a small-scale prototype whose network bandwidth scales directly with the number of nodes, and where average latency for user software to access flash store is less than 70mus, including 3.5mus of network overhead.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133690086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Rent's rule based FPGA packing for routability optimization 基于租金规则的可达性优化FPGA封装

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554763

Wenyi Feng, J. Greene, Kristofer Vorwerk, V. Pevzner, A. Kundu

Packing is a critical step in the CAD flow for cluster-based FPGA architectures, and has a significant impact on the quality of the final placement and routing results. One basic quality metric is routability. Traditionally, minimizing cut (the number of external signals) has been used as the main criterion in packing for routability optimization. This paper shows that minimizing cut is a sub-optimal criterion, and argues to use the Rent characteristic as the new criterion for FPGA packing. We further propose using a recursive bipartitioning-based k-way partitioner to optimize the Rent characteristic during packing. We developed a new packer, PPack2, based on this approach. Compared to T-VPack, PPack2 achieves 35.4%, 35.6%, and 11.2% reduction in wire length, minimal channel width, and critical path delay, respectively. These improvements show that PPack2 outperforms all previous leading packing tools (including iRAC, HDPack, and the original PPack) by a wide margin.

封装是基于集群的FPGA架构的CAD流程中的关键步骤，对最终放置和路由结果的质量有重大影响。一个基本的质量指标是可达性。传统上，最小化切割(外部信号的数量)被用作可达性优化的主要标准。本文论证了最小化切割是一个次优准则，并提出将Rent特性作为FPGA封装的新准则。我们进一步提出使用基于递归双分区的k-way分区器来优化包装过程中的租金特征。基于这种方法，我们开发了一种新的封隔器PPack2。与T-VPack相比，PPack2在线长、最小信道宽度和关键路径延迟方面分别减少了35.4%、35.6%和11.2%。这些改进表明，PPack2的性能远远超过了以前所有领先的打包工具(包括iRAC、HDPack和原始的PPack)。

引用次数: 14

A methodology for identifying and placing heterogeneous cluster groups based on placement proximity data (abstract only) 一种基于放置邻近数据识别和放置异质聚类组的方法(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554726

Farnaz Gharibian, Lesley Shannon, P. Jamieson

Due to the rapid growth in the size of designs and Field Programmable Gate Arrays (FPGAs), CAD run-time has increased dramatically. Reducing FPGA design compilation times without degrading circuit performance is crucial. In this work, we describe a novel approach for incremental design flows that both identifies tightly grouped FPGA logic blocks and then uses this information during circuit placement. Our approach reduces placement run-time on average by more than 17% while typically maintaining the design's critical path delay and marginally increasing its minimum channel width and wire length on average. Instead of following the traditional approach of evaluating a circuit's pre-placement netlist, this new algorithm analyzes designs post-placement to detect proximity data. It uses this information to non-aggressively extract heterogeneous cluster groupings from the design, which we call "gems," that consist of two to seventeen clusters. We modified VPR's simulated annealing placement algorithm to use our Singularity Placer, which first crushes each cluster grouping into a "singularity," to be treated as a single cluster. We then run the annealer over this condensed circuit, followed by an expansion of the singularities, and a second annealing phase for the entire expanded circuit.

由于设计规模和现场可编程门阵列(fpga)的快速增长，CAD运行时间急剧增加。在不降低电路性能的前提下减少FPGA设计编译时间至关重要。在这项工作中，我们描述了一种用于增量设计流程的新方法，该方法既可以识别紧密分组的FPGA逻辑块，又可以在电路放置期间使用此信息。我们的方法将放置运行时间平均减少了17%以上，同时通常保持设计的关键路径延迟，并略微增加其最小通道宽度和平均导线长度。该算法取代了传统的评估电路放置前网表的方法，而是分析放置后的设计以检测邻近数据。它使用这些信息从设计中非侵略性地提取异质集群分组，我们称之为“宝石”，由2到17个集群组成。我们修改了VPR的模拟退火放置算法，以使用我们的奇点Placer，它首先将每个集群分组粉碎成一个“奇点”，然后作为单个集群处理。然后我们在这个压缩电路上运行退火，接着是奇点的扩展，然后是整个扩展电路的第二退火阶段。

{"title":"A methodology for identifying and placing heterogeneous cluster groups based on placement proximity data (abstract only)","authors":"Farnaz Gharibian, Lesley Shannon, P. Jamieson","doi":"10.1145/2554688.2554726","DOIUrl":"https://doi.org/10.1145/2554688.2554726","url":null,"abstract":"Due to the rapid growth in the size of designs and Field Programmable Gate Arrays (FPGAs), CAD run-time has increased dramatically. Reducing FPGA design compilation times without degrading circuit performance is crucial. In this work, we describe a novel approach for incremental design flows that both identifies tightly grouped FPGA logic blocks and then uses this information during circuit placement. Our approach reduces placement run-time on average by more than 17% while typically maintaining the design's critical path delay and marginally increasing its minimum channel width and wire length on average. Instead of following the traditional approach of evaluating a circuit's pre-placement netlist, this new algorithm analyzes designs post-placement to detect proximity data. It uses this information to non-aggressively extract heterogeneous cluster groupings from the design, which we call \"gems,\" that consist of two to seventeen clusters. We modified VPR's simulated annealing placement algorithm to use our Singularity Placer, which first crushes each cluster grouping into a \"singularity,\" to be treated as a single cluster. We then run the annealer over this condensed circuit, followed by an expansion of the singularities, and a second annealing phase for the entire expanded circuit.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133113180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀