首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight swDNN:一个加速神威太湖之光上深度学习应用的库
Jiarui Fang, H. Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, Guangwen Yang
To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for accelerating deep learning applications on the newly announced world-leading supercomputer, Sunway TaihuLight. Targeting SW26010 processor, we derive a performance model that guides us in the process of identifying the most suitable approach for mapping the convolutional neural networks (CNNs) onto the 260 cores within the chip. By performing a systematic optimization that explores major factors, such as organization of convolution loops, blocking techniques, register data communication schemes, as well as reordering strategies for the two pipelines of instructions, we manage to achieve a double-precision performance over 1.6 Tflops for the convolution kernel, achieving 54% of the theoretical peak. Compared with Tesla K40m with cuDNNv5, swDNN results in 1.91-9.75x performance speedup in an evaluation with over 100 parameter configurations.
为了探索在其他商用芯片而非gpu上训练复杂深度神经网络(dnn)的潜力,我们报告了我们在swDNN上的工作,这是一个高效的库,用于加速世界领先的超级计算机神威太湖之光上的深度学习应用。针对SW26010处理器,我们推导了一个性能模型,该模型指导我们确定将卷积神经网络(cnn)映射到芯片内260个内核的最合适方法。通过对主要因素(如卷积循环的组织、阻塞技术、寄存器数据通信方案以及两个指令管道的重新排序策略)进行系统优化,我们设法实现了卷积核在1.6 Tflops以上的双精度性能,达到了理论峰值的54%。与搭载cuDNNv5的特斯拉K40m相比,在超过100个参数配置的评估中,swDNN的性能提升了1.91-9.75倍。
{"title":"swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight","authors":"Jiarui Fang, H. Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, Guangwen Yang","doi":"10.1109/IPDPS.2017.20","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.20","url":null,"abstract":"To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for accelerating deep learning applications on the newly announced world-leading supercomputer, Sunway TaihuLight. Targeting SW26010 processor, we derive a performance model that guides us in the process of identifying the most suitable approach for mapping the convolutional neural networks (CNNs) onto the 260 cores within the chip. By performing a systematic optimization that explores major factors, such as organization of convolution loops, blocking techniques, register data communication schemes, as well as reordering strategies for the two pipelines of instructions, we manage to achieve a double-precision performance over 1.6 Tflops for the convolution kernel, achieving 54% of the theoretical peak. Compared with Tesla K40m with cuDNNv5, swDNN results in 1.91-9.75x performance speedup in an evaluation with over 100 parameter configurations.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121562271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
MRapid: An Efficient Short Job Optimizer on Hadoop MRapid:一个高效的Hadoop短作业优化器
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.100
Hong Zhang, Hai Huang, Liqiang Wang
Data have been generated and collected at an accelerating pace. Hadoop has made analyzing large scale data much simpler to developers/analysts using commodity hardware. Interestingly, it has been shown that most Hadoop jobs have small input size and do not run for long time. For example, higher level query languages, such as Hive and Pig, would handle a complex query by breaking it into smaller adhoc ones. Although Hadoop is designed for handling complex queries with large data sets, we found that it is highly inefficient to operate at small scale data, despite a new Uber mode was introduced specifically to handle jobs with small input size. In this paper, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop.
数据的生成和收集速度正在加快。Hadoop让使用商用硬件的开发人员/分析师更容易分析大规模数据。有趣的是,大多数Hadoop作业的输入大小都很小,并且不会运行很长时间。例如,高级查询语言,如Hive和Pig,可以通过将复杂查询分解为更小的特殊查询来处理复杂查询。虽然Hadoop是为处理大型数据集的复杂查询而设计的,但我们发现,在小规模数据上操作效率非常低,尽管引入了一种新的Uber模式,专门用于处理小输入量的任务。在本文中,我们提出了一个优化的Hadoop扩展mrrapid,它显著加快了短作业的执行速度。它与Hadoop完全向后兼容,并且可以忽略开销。我们在微软Azure公共云上的实验表明,与最初的Hadoop相比,MRapid可以提高高达88%的性能。
{"title":"MRapid: An Efficient Short Job Optimizer on Hadoop","authors":"Hong Zhang, Hai Huang, Liqiang Wang","doi":"10.1109/IPDPS.2017.100","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.100","url":null,"abstract":"Data have been generated and collected at an accelerating pace. Hadoop has made analyzing large scale data much simpler to developers/analysts using commodity hardware. Interestingly, it has been shown that most Hadoop jobs have small input size and do not run for long time. For example, higher level query languages, such as Hive and Pig, would handle a complex query by breaking it into smaller adhoc ones. Although Hadoop is designed for handling complex queries with large data sets, we found that it is highly inefficient to operate at small scale data, despite a new Uber mode was introduced specifically to handle jobs with small input size. In this paper, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116237715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
ScalaIOExtrap: Elastic I/O Tracing and Extrapolation ScalaIOExtrap:弹性I/O跟踪和外推
Xiaoqing Luo, F. Mueller, P. Carns, John Jenkins, R. Latham, R. Ross, S. Snyder
Today’s rapid development of supercomputers has caused I/O performance to become a major performance bottleneck for many scientific applications. Trace analysis tools have thus become vital for diagnosing root causes of I/O problems. This work contributes an I/O tracing framework with (a) techniques to gather a set of lossless, elastic I/O trace files for small number of nodes, (b) a mathematical model to analyze trace data and extrapolate it to larger number of nodes, and (c) a replay engine for the extrapolated trace file to verify its accuracy. The traces can in principle be extrapolated even beyond the scale of presentday systems and provide a test if applications scale in terms of I/O. We conducted our experiments on three platforms: a commodity Linux cluster, an IBM BG/Q system, and a discrete event simulation of an IBM BG/P system. We investigate a combination of synthetic benchmarks on all platforms as well as a production scientific application on the BG/Q system. The extrapolated I/O trace replays closely resemble the I/O behavior of equivalent applications in all cases.
当今超级计算机的快速发展使得I/O性能成为许多科学应用的主要性能瓶颈。因此,跟踪分析工具对于诊断I/O问题的根本原因变得至关重要。这项工作提供了一个I/O跟踪框架,其中包括(a)收集少量节点的一组无损、弹性I/O跟踪文件的技术,(b)分析跟踪数据并将其外推到更多节点的数学模型,以及(c)外推跟踪文件的重播引擎以验证其准确性。原则上,这些跟踪可以推断出甚至超出当前系统的规模,并提供应用程序在I/O方面的规模测试。我们在三个平台上进行了实验:商用Linux集群、IBM BG/Q系统和IBM BG/P系统的离散事件模拟。我们研究了所有平台上的综合基准测试,以及BG/Q系统上的生产科学应用。外推的I/O跟踪重播在所有情况下都与等效应用程序的I/O行为非常相似。
{"title":"ScalaIOExtrap: Elastic I/O Tracing and Extrapolation","authors":"Xiaoqing Luo, F. Mueller, P. Carns, John Jenkins, R. Latham, R. Ross, S. Snyder","doi":"10.1109/IPDPS.2017.45","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.45","url":null,"abstract":"Today’s rapid development of supercomputers has caused I/O performance to become a major performance bottleneck for many scientific applications. Trace analysis tools have thus become vital for diagnosing root causes of I/O problems. This work contributes an I/O tracing framework with (a) techniques to gather a set of lossless, elastic I/O trace files for small number of nodes, (b) a mathematical model to analyze trace data and extrapolate it to larger number of nodes, and (c) a replay engine for the extrapolated trace file to verify its accuracy. The traces can in principle be extrapolated even beyond the scale of presentday systems and provide a test if applications scale in terms of I/O. We conducted our experiments on three platforms: a commodity Linux cluster, an IBM BG/Q system, and a discrete event simulation of an IBM BG/P system. We investigate a combination of synthetic benchmarks on all platforms as well as a production scientific application on the BG/Q system. The extrapolated I/O trace replays closely resemble the I/O behavior of equivalent applications in all cases.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126341188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models 基于任务的数据流编程模型的通用任务依赖管理硬件
Xubin Tan, Jaume Bosch, Miquel Vidal Piñol, C. Álvarez, Daniel Jiménez-González, E. Ayguadé, M. Valero
Task-based programming models such as OpenMP, IntelTBB and OmpSs offer the possibility of expressing dependences among tasks to drive their execution at runtime. Managing these dependences introduces noticeable overheads when targeting fine-grained tasks, diminishing the potential speedups or even introducing performance losses. To overcome this drawback, we present a general purpose hardware accelerator, Picos++, to manage the inter-task dependences efficiently in both time and energy. Our design also includes a novel nested task support. To this end, a new hardware/software co-design is presented to overcome the fact that nested tasks with dependences could result in system deadlocks due to the limited amount of resources in hardware task dependence managers. In this paper we describe a detailed implementation of this design and evaluate a parallel task-based programming model using Picos++ in a Linux embedded system with two ARM Cortex-A9 and a FPGA. The scalability and energy consumption of the real system implemented have been studied and compared against a software runtime. Even in a system limited to 2 threads, using Picos++ results in more than 1.8x speedup and 40% of energy savings in the most demanding parallelizations of real benchmarks. As a matter of fact, a hardware task dependence manager should be able to achieve much higher speedup and provide more energy savings with more threads.
基于任务的编程模型(如OpenMP、IntelTBB和omps)提供了表达任务间依赖关系的可能性,从而在运行时驱动它们的执行。当以细粒度任务为目标时,管理这些依赖项会带来明显的开销,降低潜在的速度,甚至会带来性能损失。为了克服这个缺点,我们提出了一个通用的硬件加速器pico++,在时间和精力上有效地管理任务间的依赖关系。我们的设计还包括一个新颖的嵌套任务支持。为此,提出了一种新的硬件/软件协同设计,以克服由于硬件任务依赖管理器中的资源有限而导致具有依赖关系的嵌套任务可能导致系统死锁的事实。在本文中,我们描述了该设计的详细实现,并在带有两个ARM Cortex-A9和一个FPGA的Linux嵌入式系统中使用Picos++评估了基于并行任务的编程模型。研究了实际系统的可扩展性和能耗,并与软件运行时进行了比较。即使在一个只有2个线程的系统中,使用pico++在实际基准测试中最苛刻的并行性下也能带来1.8倍以上的加速提升和40%的能耗节约。事实上,硬件任务依赖管理器应该能够实现更高的加速,并通过更多的线程提供更多的能源节约。
{"title":"General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models","authors":"Xubin Tan, Jaume Bosch, Miquel Vidal Piñol, C. Álvarez, Daniel Jiménez-González, E. Ayguadé, M. Valero","doi":"10.1109/IPDPS.2017.48","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.48","url":null,"abstract":"Task-based programming models such as OpenMP, IntelTBB and OmpSs offer the possibility of expressing dependences among tasks to drive their execution at runtime. Managing these dependences introduces noticeable overheads when targeting fine-grained tasks, diminishing the potential speedups or even introducing performance losses. To overcome this drawback, we present a general purpose hardware accelerator, Picos++, to manage the inter-task dependences efficiently in both time and energy. Our design also includes a novel nested task support. To this end, a new hardware/software co-design is presented to overcome the fact that nested tasks with dependences could result in system deadlocks due to the limited amount of resources in hardware task dependence managers. In this paper we describe a detailed implementation of this design and evaluate a parallel task-based programming model using Picos++ in a Linux embedded system with two ARM Cortex-A9 and a FPGA. The scalability and energy consumption of the real system implemented have been studied and compared against a software runtime. Even in a system limited to 2 threads, using Picos++ results in more than 1.8x speedup and 40% of energy savings in the most demanding parallelizations of real benchmarks. As a matter of fact, a hardware task dependence manager should be able to achieve much higher speedup and provide more energy savings with more threads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132565940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Image-Domain Gridding on Graphics Processors 图形处理器上的图像域网格划分
B. Veenboer, M. Petschow, J. Romein
Realizing the next generation of radio telescopes such as the Square Kilometre Array (SKA) requires both more efficient hardware and algorithms than today's technology provides. The recently introduced image-domain gridding (IDG) algorithm is a novel approach towards solving the most compute-intensive parts of creating sky images: gridding and degridding. It avoids the performance bottlenecks of traditional AW-projection gridding by applying instrumental and environmental corrections in the image domain instead of in the Fourier domain. In this paper, we present the first implementations of this new algorithm for CPUs and Graphics Processing Units (GPUs). A thorough performance analysis, in which we apply a modified roofline analysis, shows that our parallelization approaches and optimizations lead to nearly optimal performance on these architectures. The analysis also indicates that, by leveraging dedicated hardware to evaluate trigonometric functions, GPUs are both much faster and more energy efficient than regular CPUs. This makes IDG on GPUs a candidate for meeting the computational and energy efficiency constraints of future telescopes.
实现像平方公里阵列(SKA)这样的下一代射电望远镜,需要比目前技术提供的更高效的硬件和算法。最近引入的图像域网格(IDG)算法是一种解决创建天空图像中计算最密集部分的新方法:网格化和去网格化。它通过在图像域而不是在傅里叶域应用仪器和环境校正来避免传统的aww投影网格的性能瓶颈。在本文中,我们提出了这种新算法在cpu和图形处理单元(gpu)上的首次实现。全面的性能分析(其中我们应用了修改的屋顶线分析)表明,我们的并行化方法和优化在这些体系结构上产生了近乎最佳的性能。分析还表明,通过利用专用硬件来评估三角函数,gpu比普通cpu更快、更节能。这使得gpu上的IDG成为满足未来望远镜计算和能源效率限制的候选方案。
{"title":"Image-Domain Gridding on Graphics Processors","authors":"B. Veenboer, M. Petschow, J. Romein","doi":"10.1109/IPDPS.2017.68","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.68","url":null,"abstract":"Realizing the next generation of radio telescopes such as the Square Kilometre Array (SKA) requires both more efficient hardware and algorithms than today's technology provides. The recently introduced image-domain gridding (IDG) algorithm is a novel approach towards solving the most compute-intensive parts of creating sky images: gridding and degridding. It avoids the performance bottlenecks of traditional AW-projection gridding by applying instrumental and environmental corrections in the image domain instead of in the Fourier domain. In this paper, we present the first implementations of this new algorithm for CPUs and Graphics Processing Units (GPUs). A thorough performance analysis, in which we apply a modified roofline analysis, shows that our parallelization approaches and optimizations lead to nearly optimal performance on these architectures. The analysis also indicates that, by leveraging dedicated hardware to evaluate trigonometric functions, GPUs are both much faster and more energy efficient than regular CPUs. This makes IDG on GPUs a candidate for meeting the computational and energy efficiency constraints of future telescopes.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
FlexVC: Flexible Virtual Channel Management in Low-Diameter Networks FlexVC:低直径网络中的灵活虚拟通道管理
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.110
Pablo Fuentes, E. Vallejo, R. Beivide, C. Minkenberg, M. Valero
Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant amount of buffering needs to be allocated statically to avoid congestion. We introduce FlexVC, a simple buffer management mechanism which permits a more flexible use of VCs. It combines statically partitioned buffers, opportunistic routing and a relaxed distancebased deadlock avoidance policy. FlexVC mitigates Head-of-Line blocking and reduces up to 50% the memory requirements. Simulation results in a Dragonfly network show congestion reduction and up to 37.8% throughput improvement, outperforming more complex dynamic approaches. FlexVC merges different flows of traffic in the same buffers, which in some cases makes more difficult to identify the traffic pattern in order to support nonminimal adaptive routing. An alternative denoted FlexVCminCred improves congestion sensing for adaptive routing by tracking separately packets routed minimally and nonminimally, rising throughput up to 20.4% with 25% savings in buffer area.
无损低距离网络的死锁避免机制通常会增加虚拟信道(VC)索引的每一跳顺序。这将根据路由机制限制缓冲区资源的数量,并由于使用效率低下而限制性能。动态缓冲区组织增加了实现的复杂性,并且在这种情况下只提供了很小的收益,因为需要静态地分配大量缓冲以避免拥塞。我们介绍了FlexVC,一个简单的缓冲区管理机制,允许更灵活地使用vc。它结合了静态分区缓冲区、机会路由和宽松的基于距离的死锁避免策略。FlexVC减轻了排队阻塞,并减少了高达50%的内存需求。蜻蜓网络的仿真结果显示,拥塞减少了,吞吐量提高了37.8%,优于更复杂的动态方法。FlexVC将不同的流量合并到同一个缓冲区中,这在某些情况下使识别流量模式变得更加困难,以支持非最小自适应路由。另一种替代方案FlexVCminCred通过单独跟踪最小和非最小路由的数据包,改善了自适应路由的拥塞感知,将吞吐量提高到20.4%,节省了25%的缓冲区。
{"title":"FlexVC: Flexible Virtual Channel Management in Low-Diameter Networks","authors":"Pablo Fuentes, E. Vallejo, R. Beivide, C. Minkenberg, M. Valero","doi":"10.1109/IPDPS.2017.110","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.110","url":null,"abstract":"Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant amount of buffering needs to be allocated statically to avoid congestion. We introduce FlexVC, a simple buffer management mechanism which permits a more flexible use of VCs. It combines statically partitioned buffers, opportunistic routing and a relaxed distancebased deadlock avoidance policy. FlexVC mitigates Head-of-Line blocking and reduces up to 50% the memory requirements. Simulation results in a Dragonfly network show congestion reduction and up to 37.8% throughput improvement, outperforming more complex dynamic approaches. FlexVC merges different flows of traffic in the same buffers, which in some cases makes more difficult to identify the traffic pattern in order to support nonminimal adaptive routing. An alternative denoted FlexVCminCred improves congestion sensing for adaptive routing by tracking separately packets routed minimally and nonminimally, rising throughput up to 20.4% with 25% savings in buffer area.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121181802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Eliminating Irregularities of Protein Sequence Search on Multicore Architectures 多核结构下蛋白质序列搜索的不规则性消除
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.120
Jing Zhang, Sanchit Misra, Hao Wang, Wu-chun Feng
Finding regions of local similarity between biological sequences is a fundamental task in computational biology. BLAST is the most widely-used tool for this purpose, but it suffers from irregularities due to its heuristic nature. To achieve fast search, recent approaches construct the index from the database instead of the input query. However, database indexing introduces more challenges in the design of index structure and algorithm, especially for data access through the memory hierarchy on modern multicore processors. In this paper, based on existing heuristic algorithms, we design and develop a database indexed BLAST with the identical sensitivity as query indexed BLAST (i.e., NCBI-BLAST). Then, we identify that existing heuristic algorithms of BLAST can result in serious irregularities in database indexed search. To eliminate irregularities in BLAST algorithm, we propose muBLASTP, that uses multiple optimizations to improve data locality and parallel efficiency for multicore architectures and multi-node systems. Experiments on a single node demonstrate up to a 5.1-fold speedup over the multi-threaded NCBI BLAST. For the inter-node parallelism, we achieve nearly linear scaling on up to 128 nodes and gain up to 8.9-fold speedup over mpiBLAST.
寻找生物序列之间的局部相似区域是计算生物学的一项基本任务。BLAST是用于此目的的最广泛使用的工具,但由于其启发式性质,它存在不规范的问题。为了实现快速搜索,最近的方法是从数据库而不是输入查询中构造索引。然而,数据库索引给索引结构和算法的设计带来了更多的挑战,特别是在现代多核处理器上通过内存层次结构进行数据访问。本文在现有启发式算法的基础上,设计并开发了一个具有与查询索引BLAST相同灵敏度的数据库索引BLAST(即NCBI-BLAST)。然后,我们发现现有的BLAST启发式算法在数据库索引搜索中会导致严重的不规则性。为了消除BLAST算法中的不规则性,我们提出了muBLASTP算法,该算法使用多重优化来提高多核架构和多节点系统的数据局部性和并行效率。在单个节点上的实验表明,与多线程NCBI BLAST相比,速度提高了5.1倍。对于节点间并行性,我们在多达128个节点上实现了近乎线性的扩展,并获得了比mpiBLAST高达8.9倍的加速。
{"title":"Eliminating Irregularities of Protein Sequence Search on Multicore Architectures","authors":"Jing Zhang, Sanchit Misra, Hao Wang, Wu-chun Feng","doi":"10.1109/IPDPS.2017.120","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.120","url":null,"abstract":"Finding regions of local similarity between biological sequences is a fundamental task in computational biology. BLAST is the most widely-used tool for this purpose, but it suffers from irregularities due to its heuristic nature. To achieve fast search, recent approaches construct the index from the database instead of the input query. However, database indexing introduces more challenges in the design of index structure and algorithm, especially for data access through the memory hierarchy on modern multicore processors. In this paper, based on existing heuristic algorithms, we design and develop a database indexed BLAST with the identical sensitivity as query indexed BLAST (i.e., NCBI-BLAST). Then, we identify that existing heuristic algorithms of BLAST can result in serious irregularities in database indexed search. To eliminate irregularities in BLAST algorithm, we propose muBLASTP, that uses multiple optimizations to improve data locality and parallel efficiency for multicore architectures and multi-node systems. Experiments on a single node demonstrate up to a 5.1-fold speedup over the multi-threaded NCBI BLAST. For the inter-node parallelism, we achieve nearly linear scaling on up to 128 nodes and gain up to 8.9-fold speedup over mpiBLAST.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123202715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-volatile Memory 重新思考非易失性存储器的近阈值多处理器设计
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.109
Xiang Pan, Anys Bacha, R. Teodorescu
Near-threshold computing is emerging as a promising energy-efficient alternative for power-constrained environments. Unfortunately, aggressive reduction in supply voltage to the near-threshold range, albeit effective, faces a host of challenges. This includes higher relative leakage power and high error rates, particularly in dense SRAM structures such as on-chip caches. This paper presents an architecture that rethinks the cache hierarchy in near-threshold multiprocessors. Our design uses STT-RAM to implement all on-chip caches. STT-RAM has several advantages over SRAM at low voltages including low leakage, high density, and reliability. The design consolidates the private caches of near-threshold cores into shared L1 instruction/data caches organized in clusters. We find that our consolidated cache design can service more than 95% of incoming requests within a single cycle. We demonstrate that eliminating the coherence traffic associated with private caches results in a performance boost of 11%. In addition, we propose a hardware-based core management system that dynamically consolidates virtual cores into variable numbers of physical cores to increase resource efficiency. We demonstrate that this approach can save up to 33% in energy.
近阈值计算正在成为一种有前途的节能替代方案,适用于功率受限的环境。不幸的是,将电源电压大幅降低到接近阈值的范围,尽管有效,但面临着许多挑战。这包括更高的相对泄漏功率和高错误率,特别是在密集的SRAM结构中,如片上缓存。本文提出了一种重新考虑近阈值多处理器缓存层次结构的体系结构。我们的设计使用STT-RAM来实现所有片上缓存。STT-RAM在低电压下比SRAM有几个优点,包括低泄漏、高密度和可靠性。该设计将近阈值内核的私有缓存合并到集群中组织的共享L1指令/数据缓存中。我们发现我们的统一缓存设计可以在一个周期内服务超过95%的传入请求。我们证明,消除与私有缓存相关的相干流量可使性能提高11%。此外,我们提出了一种基于硬件的核心管理系统,该系统动态地将虚拟核心合并为可变数量的物理核心,以提高资源效率。我们证明,这种方法可以节省高达33%的能源。
{"title":"Respin: Rethinking Near-Threshold Multiprocessor Design with Non-volatile Memory","authors":"Xiang Pan, Anys Bacha, R. Teodorescu","doi":"10.1109/IPDPS.2017.109","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.109","url":null,"abstract":"Near-threshold computing is emerging as a promising energy-efficient alternative for power-constrained environments. Unfortunately, aggressive reduction in supply voltage to the near-threshold range, albeit effective, faces a host of challenges. This includes higher relative leakage power and high error rates, particularly in dense SRAM structures such as on-chip caches. This paper presents an architecture that rethinks the cache hierarchy in near-threshold multiprocessors. Our design uses STT-RAM to implement all on-chip caches. STT-RAM has several advantages over SRAM at low voltages including low leakage, high density, and reliability. The design consolidates the private caches of near-threshold cores into shared L1 instruction/data caches organized in clusters. We find that our consolidated cache design can service more than 95% of incoming requests within a single cycle. We demonstrate that eliminating the coherence traffic associated with private caches results in a performance boost of 11%. In addition, we propose a hardware-based core management system that dynamically consolidates virtual cores into variable numbers of physical cores to increase resource efficiency. We demonstrate that this approach can save up to 33% in energy.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124655888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallelism and Garbage Collection Aware I/O Scheduler with Improved SSD Performance 提高SSD性能的并行性和垃圾收集感知I/O调度器
Jiayang Guo, Yimin Hu, Bo Mao, Suzhen Wu
In this paper, we propose PGIS, a parallelism and garbage collection aware I/O Scheduler, which identifies the hot data based on trace characteristics to exploit the channel level internal parallelism of flash-based storage systems. PGIS not only fully exploits abundant channel resource in the SSD, but also it introduces a hot data identification mechanism to reduce the garbage collection overhead. By dispatching hot read data to different channel, the channel level internal parallelism is fully exploited. By dispatching hot write data to the same physical block, the garbage collection overhead has been alleviated. The experiment results show that compared with existing I/O schedulers, PGIS improves the response time and garbage collection performance significantly. Consequently, PGIS reduces the garbage collection overhead up to 30.9%, while exploiting channel level internal parallelism.
在本文中,我们提出了一个并行性和垃圾收集感知I/O调度器PGIS,它基于跟踪特征识别热数据,以利用基于闪存的存储系统的通道级内部并行性。PGIS不仅充分利用了SSD中丰富的信道资源,而且引入了热数据识别机制,减少了垃圾收集开销。通过将热读数据分配到不同的通道,充分利用了通道级的内部并行性。通过将热写数据分配到相同的物理块,可以减轻垃圾收集的开销。实验结果表明,与现有的I/O调度器相比,PGIS显著提高了响应时间和垃圾收集性能。因此,PGIS在利用通道级内部并行性的同时,将垃圾收集开销减少了30.9%。
{"title":"Parallelism and Garbage Collection Aware I/O Scheduler with Improved SSD Performance","authors":"Jiayang Guo, Yimin Hu, Bo Mao, Suzhen Wu","doi":"10.1109/IPDPS.2017.55","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.55","url":null,"abstract":"In this paper, we propose PGIS, a parallelism and garbage collection aware I/O Scheduler, which identifies the hot data based on trace characteristics to exploit the channel level internal parallelism of flash-based storage systems. PGIS not only fully exploits abundant channel resource in the SSD, but also it introduces a hot data identification mechanism to reduce the garbage collection overhead. By dispatching hot read data to different channel, the channel level internal parallelism is fully exploited. By dispatching hot write data to the same physical block, the garbage collection overhead has been alleviated. The experiment results show that compared with existing I/O schedulers, PGIS improves the response time and garbage collection performance significantly. Consequently, PGIS reduces the garbage collection overhead up to 30.9%, while exploiting channel level internal parallelism.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125227927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code 阿波罗:用于快速动态调优输入依赖代码的可重用模型
D. Beckingsale, Olga Pearce, I. Laguna, T. Gamblin
Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications change at runtime, and performance depends both on the architecture and the input problem, and tuning a kernel for one set of inputs may not improve its performance on another. The statically optimized versions need to be chosen dynamically to obtain the best performance. Existing auto-tuning approaches can handle slowly evolving applications effectively, but are too slow to tune highly input-dependent kernels. We developed Apollo, an auto-tuning extension for RAJA that uses pre-trained, reusable models to tune input-dependent code at runtime. Apollo is designed for highly dynamic applications; it generates sufficiently low-overhead code to tune parameters each time a kernel runs, making fast decisions. We apply Apollo to two hydrodynamics benchmarks and to a production multi-physics code, and show that it can achieve speedups from 1.2x to 4.8x.
不断增加的体系结构多样性使得性能可移植性对并行仿真代码极其重要。新兴的节点上并行化框架(如Kokkos和RAJA)将内核中完成的工作与并行化机制解耦,允许在编译时针对不同的体系结构对单个源内核进行调优。但是,生产应用程序中的计算需求在运行时发生变化,性能取决于体系结构和输入问题,针对一组输入调优内核可能不会提高其在另一组输入上的性能。静态优化版本需要动态选择,以获得最佳性能。现有的自动调优方法可以有效地处理缓慢发展的应用程序,但是对于调优高度依赖输入的内核来说太慢了。我们开发了Apollo,这是RAJA的一个自动调优扩展,它使用预训练的、可重用的模型在运行时调优依赖于输入的代码。Apollo是为高度动态应用而设计的;每次内核运行时,它都会生成足够低开销的代码来调优参数,从而快速做出决策。我们将Apollo应用于两个流体动力学基准测试和一个生产的多物理场代码,并表明它可以实现从1.2倍到4.8倍的加速。
{"title":"Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code","authors":"D. Beckingsale, Olga Pearce, I. Laguna, T. Gamblin","doi":"10.1109/IPDPS.2017.38","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.38","url":null,"abstract":"Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications change at runtime, and performance depends both on the architecture and the input problem, and tuning a kernel for one set of inputs may not improve its performance on another. The statically optimized versions need to be chosen dynamically to obtain the best performance. Existing auto-tuning approaches can handle slowly evolving applications effectively, but are too slow to tune highly input-dependent kernels. We developed Apollo, an auto-tuning extension for RAJA that uses pre-trained, reusable models to tune input-dependent code at runtime. Apollo is designed for highly dynamic applications; it generates sufficiently low-overhead code to tune parameters each time a kernel runs, making fast decisions. We apply Apollo to two hydrodynamics benchmarks and to a production multi-physics code, and show that it can achieve speedups from 1.2x to 4.8x.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125844252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1