Parametrizing multicore architectures for multiple sequence alignment

S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev
{"title":"Parametrizing multicore architectures for multiple sequence alignment","authors":"S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev","doi":"10.1145/2016604.2016642","DOIUrl":null,"url":null,"abstract":"Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2016604.2016642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多序列比对的多核结构参数化
序列比对是生物信息学的基本任务之一。由于生物数据的指数增长和所用算法的计算复杂性,需要高性能的计算系统。尽管多核体系结构有潜力利用这些工作负载中的任务级并行性,但要有效利用具有数百个核心的系统,需要对应用程序和体系结构有深入的了解。当合并大量核心时,性能可伸缩性可能会使总线和内存等共享硬件资源饱和。在本文中,我们评估了基于加速器的多核架构的各种配置对性能的影响,目的是揭示和量化瓶颈。然后,我们比较了使用通用处理器的多核,并讨论了性能差距。我们的目标应用程序是ClustalW,它是最流行的多序列比对程序之一。我们描述了不同的输入数据集,并展示了它们如何影响性能。仿真结果表明,由于高计算通信比和大数据块传输,内存延迟可以很好地耐受。然而,带宽是实现最大性能的关键。使用带有4个bank的32KB缓存配置可以捕获大部分内存流量,从而避免昂贵的片外事务。另一方面,使用硬件队列进行任务同步使我们能够处理大量的内核。最后,我们展示了使用简单的负载平衡策略,我们可以将通用核心的性能提高28%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Strategies for improving performance and energy efficiency on a many-core Cost-effective soft-error protection for SRAM-based structures in GPGPUs Kinship: efficient resource management for performance and functionally asymmetric platforms An algorithm for parallel calculation of trigonometric functions DCNSim: a unified and cross-layer computer architecture simulation framework for data center network research
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1