2011 IEEE 9th Symposium on Application Specific Processors (SASP)最新文献

英文中文

A novel parallel Tier-1 coder for JPEG2000 using GPUs 一种基于gpu的JPEG2000并行1层编码器

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941091

Roto Le, R. I. Bahar, J. Mundy

The JPEG2000 image compression standard provides superior features to the popular JPEG standard; however, the slow performance of software implementation of JPEG2000 has kept it from being widely adopted. More than 80% of the execution time for JPEG2000 is spent on the Tier-1 coding engine. While much effort over the past decade has been devoted to optimizing this component, its performance still remains slow. The major reason for this is that the Tier-1 coder consists of highly serial operations, each operating on individual bits in every single bit plane of the image samples. In addition, in the past there lacked an efficient hardware platform to provide massively parallel acceleration for Tier-1. However, the recent growth of general purpose graphic processing unit (GPGPU) provides a great opportunity to solve the problem with thousands of parallel processing threads. In this paper, the computation steps in JPEG2000 are examined, particularly in the Tier-1, and novel, GPGPU compatible, parallel processing methods for the sample-level coding of the images are developed. The GPGPU-based parallel engine allows for significant speedup in execution time compared to the JasPer JPEG2000 compression software. Running on a single Nvidia GTX 480 GPU, the parallel wavelet engine achieves 100× speedup, the parallel bit plane coder achieves more than 30× speedup, and the overall Tier-1 coder achieves up to 17× speedup.

JPEG2000图像压缩标准提供了比流行的JPEG标准更好的特性;然而，JPEG2000软件实现的缓慢性能阻碍了它的广泛采用。JPEG2000超过80%的执行时间花在了Tier-1编码引擎上。虽然在过去的十年里，人们一直致力于优化这个组件，但它的性能仍然很慢。主要原因是第1层编码器由高度串行的操作组成，每个操作在图像样本的每个位平面中的单个位上操作。此外，过去缺乏有效的硬件平台来为Tier-1提供大规模并行加速。然而，近年来通用图形处理单元(GPGPU)的发展为解决数千个并行处理线程的问题提供了一个很好的机会。本文研究了JPEG2000的计算步骤，特别是第1层的计算步骤，并开发了新的、与GPGPU兼容的、用于图像样本级编码的并行处理方法。与JasPer JPEG2000压缩软件相比，基于gpgpu的并行引擎可以显著加快执行时间。并行小波引擎在单个Nvidia GTX 480 GPU上实现了100倍的加速，并行位平面编码器实现了30倍以上的加速，整体Tier-1编码器实现了17倍的加速。

{"title":"A novel parallel Tier-1 coder for JPEG2000 using GPUs","authors":"Roto Le, R. I. Bahar, J. Mundy","doi":"10.1109/SASP.2011.5941091","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941091","url":null,"abstract":"The JPEG2000 image compression standard provides superior features to the popular JPEG standard; however, the slow performance of software implementation of JPEG2000 has kept it from being widely adopted. More than 80% of the execution time for JPEG2000 is spent on the Tier-1 coding engine. While much effort over the past decade has been devoted to optimizing this component, its performance still remains slow. The major reason for this is that the Tier-1 coder consists of highly serial operations, each operating on individual bits in every single bit plane of the image samples. In addition, in the past there lacked an efficient hardware platform to provide massively parallel acceleration for Tier-1. However, the recent growth of general purpose graphic processing unit (GPGPU) provides a great opportunity to solve the problem with thousands of parallel processing threads. In this paper, the computation steps in JPEG2000 are examined, particularly in the Tier-1, and novel, GPGPU compatible, parallel processing methods for the sample-level coding of the images are developed. The GPGPU-based parallel engine allows for significant speedup in execution time compared to the JasPer JPEG2000 compression software. Running on a single Nvidia GTX 480 GPU, the parallel wavelet engine achieves 100× speedup, the parallel bit plane coder achieves more than 30× speedup, and the overall Tier-1 coder achieves up to 17× speedup.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"81 26","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A hardware acceleration technique for gradient descent and conjugate gradient 梯度下降和共轭梯度的硬件加速技术

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941086

David Kesler, Biplab Deka, Rakesh Kumar

Application Robustification, a promising approach for reducing processor power, converts applications into numerical optimization problems and solves them using gradient descent and conjugate gradient algorithms [1]. The improvement in robustness, however, comes at the expense of performance when compared to the baseline non-iterative versions of these applications. To mitigate the performance loss from robustification, we present the design of a hardware accelerator and corresponding software support that accelerate gradient descent and conjugate gradient based iterative implementation of applications. Unlike traditional accelerators, our design accelerates different types of linear algebra operations found in many algorithms and is capable of efficiently handling sparse matrices that arise in applications such as graph matching. We show that the proposed accelerator can provide significant speedups for iterative versions of several applications and that for some applications such as least squares, it can substantially improve the computation time as compared to the baseline non-iterative implementation.

应用鲁棒化是一种很有前途的降低处理器功耗的方法，它将应用转化为数值优化问题，并使用梯度下降和共轭梯度算法求解[1]。然而，与这些应用程序的基线非迭代版本相比，健壮性的改进是以牺牲性能为代价的。为了减轻鲁棒化带来的性能损失，我们设计了一个硬件加速器和相应的软件支持，以加速梯度下降和基于共轭梯度的应用迭代实现。与传统的加速器不同，我们的设计加速了许多算法中发现的不同类型的线性代数运算，并且能够有效地处理图匹配等应用中出现的稀疏矩阵。我们表明，所提出的加速器可以为几个应用程序的迭代版本提供显著的加速，并且对于一些应用程序(如最小二乘)，与基线非迭代实现相比，它可以大大提高计算时间。

引用次数: 9

How sensitive is processor customization to the workload's input datasets? 处理器自定义对工作负载的输入数据集有多敏感?

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941070

Maximilien Breughe, Zheng Li, Yang Chen, Stijn Eyerman, O. Temam, Chengyong Wu, L. Eeckhout

Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to what extent processor customization is sensitive to the training workload's input datasets. Current practice is to consider a single or only a few input datasets per workload during the processor design cycle — the reason being that simulation is prohibitively time-consuming which excludes considering a large number of datasets. This paper addresses this fundamental question, for the first time. In order to perform the large number of runs required to address this question in a reasonable amount of time, we first propose a mechanistic analytical model, built from first principles, that is accurate within 3.6% on average across a broad design space. The analytical model is at least 4 orders of magnitude faster than detailed cycle-accurate simulation for design space exploration. Using the model, we are able to study the sensitivity of a workload's input dataset on the optimum customized processor architecture. Considering MiBench benchmarks and 1000 datasets per benchmark, we conclude that processor customization is largely dataset-insensitive. This has an important implication in practice: a single or only a few datasets are sufficient for determining the optimum processor architecture when designing application-specific processors.

硬件定制是在满足应用程序性能需求的同时实现高能效的有效方法。特定于应用程序的处理器通过针对特定工作负载(即感兴趣的应用程序或应用程序领域)定制设计来实现低能耗的高性能。到目前为止，仍有一个基本问题没有得到解答，那就是处理器定制对训练工作量的输入数据集的敏感程度。目前的做法是在处理器设计周期中考虑每个工作负载的单个或仅几个输入数据集——原因是模拟非常耗时，不包括考虑大量数据集。本文首次解决了这个基本问题。为了在合理的时间内执行解决这个问题所需的大量运行，我们首先提出了一个基于第一性原理的机制分析模型，该模型在广泛的设计空间中平均精度在3.6%以内。分析模型比设计空间探索的详细周期精确仿真至少快4个数量级。利用该模型，我们能够研究工作负载输入数据集在最佳定制处理器架构上的敏感性。考虑到MiBench基准测试和每个基准测试1000个数据集，我们得出结论，处理器定制在很大程度上对数据集不敏感。这在实践中有一个重要的含义:在设计特定于应用程序的处理器时，单个或仅几个数据集就足以确定最佳的处理器架构。

{"title":"How sensitive is processor customization to the workload's input datasets?","authors":"Maximilien Breughe, Zheng Li, Yang Chen, Stijn Eyerman, O. Temam, Chengyong Wu, L. Eeckhout","doi":"10.1109/SASP.2011.5941070","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941070","url":null,"abstract":"Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to what extent processor customization is sensitive to the training workload's input datasets. Current practice is to consider a single or only a few input datasets per workload during the processor design cycle — the reason being that simulation is prohibitively time-consuming which excludes considering a large number of datasets. This paper addresses this fundamental question, for the first time. In order to perform the large number of runs required to address this question in a reasonable amount of time, we first propose a mechanistic analytical model, built from first principles, that is accurate within 3.6% on average across a broad design space. The analytical model is at least 4 orders of magnitude faster than detailed cycle-accurate simulation for design space exploration. Using the model, we are able to study the sensitivity of a workload's input dataset on the optimum customized processor architecture. Considering MiBench benchmarks and 1000 datasets per benchmark, we conclude that processor customization is largely dataset-insensitive. This has an important implication in practice: a single or only a few datasets are sufficient for determining the optimum processor architecture when designing application-specific processors.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"7 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130921092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Modular high-throughput and low-latency sorting units for FPGAs in the Large Hadron Collider 大型强子对撞机fpga的模块化高通量和低延迟排序单元

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941075

Amin Farmahini Farahani, A. Gregerson, M. Schulte, Katherine Compton

This paper presents efficient techniques for designing high-throughput, low-latency sorting units for FPGA implementation. Our sorting units use modular design techniques that hierarchically construct large sorting units from smaller building blocks. They are optimized for situations in which only the M largest numbers from N inputs are needed; this situation commonly occurs in high-energy physics experiments and other forms of digital signal processing. Based on these techniques, we design parameterized, pipelined sorting units. A detailed analysis indicates that their resource requirements scale linearly with the number of inputs, latencies scale logarithmically with the number of inputs, and frequencies remain fairly constant. Synthesis results indicate that a single pipelined 256-to-4 sorting unit with 19 stages can perform 200 million sorts per second with a latency of about 95 ns per sort on a Virtex-5 FPGA.

本文介绍了设计用于FPGA实现的高吞吐量、低延迟排序单元的有效技术。我们的排序单元使用模块化设计技术，从较小的构建块分层构建大型排序单元。它们针对只需要N个输入中的M个最大数字的情况进行了优化;这种情况常见于高能物理实验和其他形式的数字信号处理中。基于这些技术，我们设计了参数化、流水线式的分选单元。详细的分析表明，它们的资源需求与输入数量呈线性关系，延迟与输入数量呈对数关系，频率保持相当恒定。综合结果表明，在Virtex-5 FPGA上，具有19个阶段的单个256到4的流水线排序单元每秒可以执行2亿次排序，每次排序的延迟约为95 ns。

引用次数: 16

TARCAD: A template architecture for reconfigurable accelerator designs 用于可重构加速器设计的模板体系结构

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941071

M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé

In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies.

在提高计算效率的竞赛中，加速器的地位日益突出。在不同类型的加速器中，使用可重构结构(如fpga)构建的加速器由于能够根据应用程序定制硬件而具有巨大的潜力。然而，缺乏标准的设计方法阻碍了这些设备的采用，并且使得跨设计的可移植性和可重用性变得困难。此外，高度定制电路的生成不能很好地与高级合成工具集成。在这项工作中，我们介绍了TARCAD，一个模板架构来设计可重构加速器。TARCAD支持数据管理和计算引擎的高度定制，同时保留基于通用编程原则的编程模型。该模板在一系列fpga上提供通用性和可扩展性能。我们详细描述了模板架构，并展示了如何实现五个重要的科学内核:MxM，声波方程，FFT, SpMV和Smith Waterman。TARCAD与其他高级综合模型进行了比较，并针对gpu进行了评估，gpu是一种众所周知的架构，可定制性要低得多，因此也更容易从简单和可移植的编程模型中定位。我们分析了TARCAD模板，并将其在大型Xilinx Virtex-6设备上的效率与最近几项GPU研究的效率进行了比较。

{"title":"TARCAD: A template architecture for reconfigurable accelerator designs","authors":"M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé","doi":"10.1109/SASP.2011.5941071","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941071","url":null,"abstract":"In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Hardware/software co-designed accelerator for vector graphics applications 硬件/软件共同设计的矢量图形应用加速器

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941088

Shuo-Hung Chen, Hsiao-Mei Lin, H. Wei, Yi-Cheng Chen, Chih-Tsun Huang, Yeh-Ching Chung

This paper proposes a new hardware accelerator to speed up the performance of vector graphics applications on complex embedded systems. The resulting hardware accelerator is synthesized on a field-programmable gate array (FPGA) and integrated with software components. The paper also introduces a hardware/software co-verification environment which provides in-system at-speed functional verification and performance evaluation to verify the hardware/software integrated architecture. The experimental results demonstrate that the integrated hardware accelerator is fifty times faster than a compiler-optimized software component and it enables vector graphics applications to run nearly two times faster.

本文提出了一种新的硬件加速器，以提高复杂嵌入式系统中矢量图形应用程序的性能。所得到的硬件加速器在现场可编程门阵列(FPGA)上合成，并与软件组件集成。本文还介绍了一个软硬件协同验证环境，该环境提供了系统内快速的功能验证和性能评估，以验证软硬件集成架构。实验结果表明，集成的硬件加速器比编译器优化的软件组件快50倍，使矢量图形应用程序的运行速度快近两倍。

引用次数: 5

Dynamically reconfigurable architecture for a driver assistant system 驾驶员辅助系统的动态可重构体系结构

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941079

N. Harb, S. Niar, M. Saghir, Y. Elhillali, R. B. Atitallah

Application-specific programmable processors are increasingly being replaced by FPGAs, which offer high levels of logic density, rich sets of embedded hardware blocks, and a high degree of customizability and reconfigurability. New FPGA features such as Dynamic Partial Reconfiguration (DPR) can be leveraged to reduce resource utilization and power consumption while still providing high levels of performance. In this paper, we describe our implementation of a dynamically reconfigurable multiple-target tracking (MTT) module for an automotive driver assistance system. Our module implements a dynamically reconfigurable filtering block that changes with changing driving conditions.

特定应用的可编程处理器正逐渐被fpga所取代，后者提供了高水平的逻辑密度、丰富的嵌入式硬件块集以及高度的可定制性和可重构性。可以利用动态部分重新配置(DPR)等新的FPGA特性来降低资源利用率和功耗，同时仍然提供高水平的性能。在本文中，我们描述了我们的实现动态可重构多目标跟踪(MTT)模块的汽车驾驶辅助系统。我们的模块实现了一个动态可重构的过滤块，随着驾驶条件的变化而变化。

引用次数: 14

System integration of Elliptic Curve Cryptography on an OMAP platform 椭圆曲线密码在OMAP平台上的系统集成

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941077

Sergey Morozov, Christian Tergino, P. Schaumont

Elliptic Curve Cryptography (ECC) is popular for digital signatures and other public-key crypto-applications in embedded contexts. However, ECC is computationally intensive, and in particular the performance of the underlying modular arithmetic remains a concern. We investigate the design space of ECC on TI's OMAP 3530 platform, with a focus on using OMAP's DSP core to accelerate ECC computations for the ARM Cortex A8 core. We examine the opportunities of the heterogeneous platform for efficient ECC, including the efficient implementation of the underlying field multiplication on the DSP, and the design partitioning to minimize the communications overhead between ARM and DSP. By migrating the computations to the DSP, we demonstrate a significant speedup for the underlying modular arithmetic with up to 9.24x reduction in execution time, compared to the implementation executing on the ARM Cortex processor. Prototype measurements show an energy reduction of up to 5.3 times. We conclude that a heterogeneous platform offers substantial improvements in performance and energy, but we also point out that the cost of inter-processor communication cannot be ignored.

椭圆曲线加密(ECC)在嵌入式环境中的数字签名和其他公钥加密应用中很流行。然而，ECC是计算密集型的，特别是底层模块化算法的性能仍然是一个问题。我们研究了ECC在TI的OMAP 3530平台上的设计空间，重点研究了使用OMAP的DSP内核来加速ARM Cortex A8内核的ECC计算。我们研究了高效ECC的异构平台的机会，包括在DSP上有效实现底层字段乘法，以及设计分区以最大限度地减少ARM和DSP之间的通信开销。通过将计算迁移到DSP，我们证明了底层模块化算法的显著加速，与在ARM Cortex处理器上执行的实现相比，执行时间减少了9.24倍。原型测量显示能量减少了5.3倍。我们得出结论，异构平台在性能和能源方面提供了实质性的改进，但我们也指出处理器间通信的成本不容忽视。

{"title":"System integration of Elliptic Curve Cryptography on an OMAP platform","authors":"Sergey Morozov, Christian Tergino, P. Schaumont","doi":"10.1109/SASP.2011.5941077","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941077","url":null,"abstract":"Elliptic Curve Cryptography (ECC) is popular for digital signatures and other public-key crypto-applications in embedded contexts. However, ECC is computationally intensive, and in particular the performance of the underlying modular arithmetic remains a concern. We investigate the design space of ECC on TI's OMAP 3530 platform, with a focus on using OMAP's DSP core to accelerate ECC computations for the ARM Cortex A8 core. We examine the opportunities of the heterogeneous platform for efficient ECC, including the efficient implementation of the underlying field multiplication on the DSP, and the design partitioning to minimize the communications overhead between ARM and DSP. By migrating the computations to the DSP, we demonstrate a significant speedup for the underlying modular arithmetic with up to 9.24x reduction in execution time, compared to the implementation executing on the ARM Cortex processor. Prototype measurements show an energy reduction of up to 5.3 times. We conclude that a heterogeneous platform offers substantial improvements in performance and energy, but we also point out that the cost of inter-processor communication cannot be ignored.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116021080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

ARTE: An Application-specific Run-Time management framework for multi-core systems 用于多核系统的特定于应用程序的运行时管理框架

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941085

Giovanni Mariani, G. Palermo, C. Silvano, V. Zaccaria

Programmable multi-core and many-core platforms increase exponentially the challenge of task mapping and scheduling, provided that enough task-parallelism does exist for each application. This problem worsens when dealing with small ecosystems such as embedded systems-on-chip. In fact, in this case, the assumption of exploiting a traditional operating system is out of context given the memory available to satisfy the run-time footprint of such a configuration.

可编程多核和多核平台为每个应用程序提供了足够的任务并行性，从而成倍地增加了任务映射和调度的挑战。在处理诸如嵌入式芯片系统之类的小型生态系统时，这个问题会变得更糟。实际上，在这种情况下，利用传统操作系统的假设已经脱离了上下文，因为可用内存可以满足这种配置的运行时占用空间。

引用次数: 7

A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching 近似核苷酸序列匹配的agrep算法的快速CUDA实现

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941082

Hongjian Li, Bing Ni, M. Wong, K. Leung

The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high. We have developed a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching by exploiting the huge computational power of modern GPU hardware. Our CUDA program is capable of searching large genomes for patterns of length up to 64 with edit distance up to 9. For example, it is able to search the entire human genome (3.10 Gbp in 24 chromosomes) for patterns of lengths of 30 and 60 with edit distances of 3 and 6 within 371 and 1,188 milliseconds respectively on one NVIDIA GeForce GTX285 graphics card, achieving 70-fold and 36-fold speedups over multithreaded QuadCore CPU counterpart. Our program employs online approach and does not require building indexes of any kind, it thus can be applied in real time. Using two-bits-for-one-character binary representation, its memory requirement is merely one fourth of the original genome size. Therefore it is possible to load multiple genomes simultaneously. The x86 and x64 executables for Linux and Windows, C++ source code, documentations, user manual, and an AJAX MVC website for online real time searching are available at http://agrep.cse.cuhk.edu.hk. Users can also send emails to CUDAagrepGmail.com to queue up for a job.

大量核苷酸序列的可用性催化了近似DNA和RNA字符串匹配的快速算法的发展。然而，大多数现有的在线算法只能处理小规模问题。当查询大型基因组时，它们的性能变得不可接受。Bowtie和BWA等离线算法需要建立索引，并且它们对内存的要求很高。通过利用现代GPU硬件的巨大计算能力，我们开发了一种用于近似核苷酸序列匹配的agrep算法的快速CUDA实现。我们的CUDA程序能够搜索长度达64的大基因组模式，编辑距离达9。例如，它能够在一块NVIDIA GeForce GTX285显卡上搜索整个人类基因组(24条染色体中有3.10 Gbp)，分别在371毫秒和1188毫秒内搜索长度为30和60的模式，编辑距离为3和6，比多线程四核CPU实现70倍和36倍的速度。我们的程序采用在线方式，不需要建立任何类型的索引，因此可以实时应用。使用2位1字符的二进制表示，其内存需求仅为原始基因组大小的四分之一。因此，同时加载多个基因组是可能的。用于Linux和Windows的x86和x64可执行文件、c++源代码、文档、用户手册以及用于在线实时搜索的AJAX MVC网站可在http://agrep.cse.cuhk.edu.hk上获得。用户还可以向CUDAagrepGmail.com发送电子邮件，排队等待工作。

{"title":"A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching","authors":"Hongjian Li, Bing Ni, M. Wong, K. Leung","doi":"10.1109/SASP.2011.5941082","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941082","url":null,"abstract":"The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high. We have developed a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching by exploiting the huge computational power of modern GPU hardware. Our CUDA program is capable of searching large genomes for patterns of length up to 64 with edit distance up to 9. For example, it is able to search the entire human genome (3.10 Gbp in 24 chromosomes) for patterns of lengths of 30 and 60 with edit distances of 3 and 6 within 371 and 1,188 milliseconds respectively on one NVIDIA GeForce GTX285 graphics card, achieving 70-fold and 36-fold speedups over multithreaded QuadCore CPU counterpart. Our program employs online approach and does not require building indexes of any kind, it thus can be applied in real time. Using two-bits-for-one-character binary representation, its memory requirement is merely one fourth of the original genome size. Therefore it is possible to load multiple genomes simultaneously. The x86 and x64 executables for Linux and Windows, C++ source code, documentations, user manual, and an AJAX MVC website for online real time searching are available at http://agrep.cse.cuhk.edu.hk. Users can also send emails to CUDAagrepGmail.com to queue up for a job.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122911398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 9th Symposium on Application Specific Processors (SASP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀