Parametrizing multicore architectures for multiple sequence alignment

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI:10.1145/2016604.2016642

S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev

{"title":"Parametrizing multicore architectures for multiple sequence alignment","authors":"S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev","doi":"10.1145/2016604.2016642","DOIUrl":null,"url":null,"abstract":"Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2016604.2016642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多序列比对的多核结构参数化

序列比对是生物信息学的基本任务之一。由于生物数据的指数增长和所用算法的计算复杂性，需要高性能的计算系统。尽管多核体系结构有潜力利用这些工作负载中的任务级并行性，但要有效利用具有数百个核心的系统，需要对应用程序和体系结构有深入的了解。当合并大量核心时，性能可伸缩性可能会使总线和内存等共享硬件资源饱和。在本文中，我们评估了基于加速器的多核架构的各种配置对性能的影响，目的是揭示和量化瓶颈。然后，我们比较了使用通用处理器的多核，并讨论了性能差距。我们的目标应用程序是ClustalW，它是最流行的多序列比对程序之一。我们描述了不同的输入数据集，并展示了它们如何影响性能。仿真结果表明，由于高计算通信比和大数据块传输，内存延迟可以很好地耐受。然而，带宽是实现最大性能的关键。使用带有4个bank的32KB缓存配置可以捕获大部分内存流量，从而避免昂贵的片外事务。另一方面，使用硬件队列进行任务同步使我们能够处理大量的内核。最后，我们展示了使用简单的负载平衡策略，我们可以将通用核心的性能提高28%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM International Conference on Computing Frontiers

自引率

0.00%

发文量