首页 > 最新文献

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
A novel migration-based NUCA design for Chip Multiprocessors 一种新的基于迁移的芯片多处理器NUCA设计
M. Kandemir, Feihui Li, M. J. Irwin, S. Son
Chip Multiprocessors (CMPs) and Non-Uniform Cache Architectures (NUCAs) represent two emerging trends in computer architecture. Targeting future CMP based systems with NUCA type L2 caches, this paper proposes a novel data migration algorithm for parallel applications and evaluates it. The goal of this migration scheme is to determine a suitable location for each data block within a large L2 space at any given point during execution. A unique characteristic of the proposed scheme is that it models the problem of optimal data placement in the L2 cache space as a two-dimensional post office placement problem, presents a practical architectural implementation of this model, and gives a detailed evaluation of the proposed implementation. In our experimental evaluation, we also compare our approach to a previously-proposed NUCA management scheme using applications from the specomp suite, oltp, specjbb, and specweb. These experiments show that our migration approach generates about 35% improvement, on average, in average L2 access latency over the previous migration scheme, and these L2 latency savings translate, on average, to 9.5% improvement in IPC (instructions per cycle).We also observed during our experiments that both the careful initial placement of data (which itself triggers migrations within the L2 space) and subsequent migrations (due to inter-processor data sharing) play an important role in achieving our performance improvements.
芯片多处理器(cmp)和非统一缓存体系结构(nuca)代表了计算机体系结构的两个新兴趋势。针对未来具有NUCA类型L2缓存的CMP系统,本文提出了一种新的并行应用数据迁移算法,并对其进行了评估。此迁移方案的目标是在执行期间的任何给定点为大型L2空间中的每个数据块确定合适的位置。所提出方案的一个独特之处在于,它将L2缓存空间中的最佳数据放置问题建模为二维邮局放置问题,给出了该模型的实际架构实现,并对所提出的实现进行了详细的评估。在我们的实验评估中,我们还将我们的方法与先前提出的使用specomp套件、oltp、specjbb和specweb应用程序的NUCA管理方案进行了比较。这些实验表明,与以前的迁移方案相比,我们的迁移方法在平均L2访问延迟方面平均提高了约35%,并且这些L2延迟节省平均转化为IPC(每周期指令)的9.5%改进。在我们的实验中,我们还观察到,谨慎的初始数据放置(其本身会触发L2空间内的迁移)和后续迁移(由于处理器间数据共享)在实现性能改进方面发挥了重要作用。
{"title":"A novel migration-based NUCA design for Chip Multiprocessors","authors":"M. Kandemir, Feihui Li, M. J. Irwin, S. Son","doi":"10.1109/SC.2008.5216918","DOIUrl":"https://doi.org/10.1109/SC.2008.5216918","url":null,"abstract":"Chip Multiprocessors (CMPs) and Non-Uniform Cache Architectures (NUCAs) represent two emerging trends in computer architecture. Targeting future CMP based systems with NUCA type L2 caches, this paper proposes a novel data migration algorithm for parallel applications and evaluates it. The goal of this migration scheme is to determine a suitable location for each data block within a large L2 space at any given point during execution. A unique characteristic of the proposed scheme is that it models the problem of optimal data placement in the L2 cache space as a two-dimensional post office placement problem, presents a practical architectural implementation of this model, and gives a detailed evaluation of the proposed implementation. In our experimental evaluation, we also compare our approach to a previously-proposed NUCA management scheme using applications from the specomp suite, oltp, specjbb, and specweb. These experiments show that our migration approach generates about 35% improvement, on average, in average L2 access latency over the previous migration scheme, and these L2 latency savings translate, on average, to 9.5% improvement in IPC (instructions per cycle).We also observed during our experiments that both the careful initial placement of data (which itself triggers migrations within the L2 space) and subsequent migrations (due to inter-processor data sharing) play an important role in achieving our performance improvements.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121809251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Positivity, posynomials and tile size selection 正性,多项式和瓦片大小选择
Lakshminarayanan Renganarayanan, S. Rajopadhye
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by developing cost models that characterize the performance of the tiled program as a function of tile sizes. All previous approaches to tile size selection (TSS) are cost model specific. Due to this they are neither extensible (e.g., to richer program classes/newer architectures) nor scalable (e.g., to multiple levels of tiling). This paper identifies positivity as a fundamental property shared by the functions and parameters commonly used in TSS models. We show how this positivity can be used as a basis to derive a TSS framework which is both efficient and scalable. We also show that almost all TSS models proposed in the literature (including those used in production compilers and auto-tuners) can be reduced to our framework.
平铺是一种广泛使用的循环转换,用于暴露/利用并行性和数据局部性。有效地使用瓷砖需要选择和调整瓷砖的大小。这通常是通过开发成本模型来实现的,该模型将平铺程序的性能表征为平铺尺寸的函数。所有先前的瓷砖尺寸选择(TSS)方法都是特定于成本模型的。因此,它们既不能扩展(例如,扩展到更丰富的程序类/更新的体系结构),也不能扩展(例如,扩展到多个平铺层)。本文认为,正性是TSS模型中常用的函数和参数所共有的一个基本属性。我们展示了如何利用这种积极性作为基础,推导出既高效又可扩展的TSS框架。我们还表明,几乎所有文献中提出的TSS模型(包括生产编译器和自动调谐器中使用的模型)都可以简化到我们的框架中。
{"title":"Positivity, posynomials and tile size selection","authors":"Lakshminarayanan Renganarayanan, S. Rajopadhye","doi":"10.1145/1413370.1413426","DOIUrl":"https://doi.org/10.1145/1413370.1413426","url":null,"abstract":"Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by developing cost models that characterize the performance of the tiled program as a function of tile sizes. All previous approaches to tile size selection (TSS) are cost model specific. Due to this they are neither extensible (e.g., to richer program classes/newer architectures) nor scalable (e.g., to multiple levels of tiling). This paper identifies positivity as a fundamental property shared by the functions and parameters commonly used in TSS models. We show how this positivity can be used as a basis to derive a TSS framework which is both efficient and scalable. We also show that almost all TSS models proposed in the literature (including those used in production compilers and auto-tuners) can be reduced to our framework.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130197085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Early evaluation of IBM BlueGene/P IBM BlueGene/P的早期评估
S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, Collin McCurdy, James H. Rogers, P. Roth, R. Sankaran, J. Vetter, P. Worley, Weikuan Yu
BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.
BlueGene/P (BG/P)是继BlueGene/L (BG/L)之后,IBM推出的第二代BlueGene架构。BG/P是一种系统级芯片(SoC)设计,使用四个PowerPC 450内核,工作频率为850 MHz,每个内核具有双精度,双管道浮点单元。这些芯片通过包括三维环面、全球集体网络和全球屏障网络在内的多个互连网络连接。该设计旨在提供一个高度可扩展的、物理密度高的系统,每个触发器的功耗要求相对较低。在本文中,我们报告了我们在一系列重要科学应用的背景下对BG/P的检查,并与当今使用的其他主要大型超级计算机进行了比较。我们的调查证实,BG/P具有良好的可扩展性,与Cray XT4的Opteron相比,每个处理器的预期性能较低。我们还发现,对于某些内核,BG/P每次浮点运算的功耗非常低,但当考虑到任务应用的科学驱动指标时,它就没有那么大的功耗优势了。
{"title":"Early evaluation of IBM BlueGene/P","authors":"S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, Collin McCurdy, James H. Rogers, P. Roth, R. Sankaran, J. Vetter, P. Worley, Weikuan Yu","doi":"10.1109/SC.2008.5214725","DOIUrl":"https://doi.org/10.1109/SC.2008.5214725","url":null,"abstract":"BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129817376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
Prefetch throttling and data pinning for improving performance of shared caches 预取节流和数据固定用于提高共享缓存的性能
O. Ozturk, S. Son, M. Kandemir, Mustafa Karaköy
In this paper, we (i) quantify the impact of compiler-directed I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of clients (compute nodes) is increased; (ii) identify inter-client misses due to harmful I/O prefetches as one of the main sources for this reduction in performance with increased number of clients; and (iii) propose and experimentally evaluate prefetch throttling and data pinning schemes to improve performance of I/O prefetching. Prefetch throttling prevents one or more clients from issuing further prefetches if such prefetches are predicted to be harmful, i.e., replace from the memory cache the useful data accessed by other clients. Data pinning on the other hand makes selected data blocks immune to harmful prefetches by pinning them in the memory cache. We show that these two schemes can be applied in isolation or combined together, and they can be applied at a coarse or fine granularity. Our experiments with these two optimizations using four disk-intensive applications reveal that they can improve performance by 9.7% and 15.1% on average, over standard compiler-directed I/O prefetching and no-prefetch case, respectively, when 8 clients are used.
在本文中,我们(i)量化了编译器定向i /O预取对i /O节点上共享缓存的影响。收集的实验数据表明,虽然I/O预取带来了一些好处,但随着客户端(计算节点)数量的增加,其有效性显著降低;(ii)识别由于有害的I/O预取而导致的客户端间丢失,这是随着客户端数量增加而导致性能下降的主要原因之一;(iii)提出并实验评估预取节流和数据固定方案,以提高I/O预取的性能。预取节流防止一个或多个客户端发出进一步的预取,如果预取被预测为有害的,即从内存缓存中替换其他客户端访问的有用数据。另一方面,数据绑定通过将选定的数据块固定在内存缓存中,使其免受有害预取的影响。我们证明了这两种方案可以单独应用或组合应用,它们可以在粗粒度或细粒度上应用。我们使用四个磁盘密集型应用程序对这两种优化进行的实验表明,当使用8个客户机时,它们可以比标准的编译器定向I/O预取和无预取分别提高9.7%和15.1%的性能。
{"title":"Prefetch throttling and data pinning for improving performance of shared caches","authors":"O. Ozturk, S. Son, M. Kandemir, Mustafa Karaköy","doi":"10.1145/1413370.1413430","DOIUrl":"https://doi.org/10.1145/1413370.1413430","url":null,"abstract":"In this paper, we (i) quantify the impact of compiler-directed I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of clients (compute nodes) is increased; (ii) identify inter-client misses due to harmful I/O prefetches as one of the main sources for this reduction in performance with increased number of clients; and (iii) propose and experimentally evaluate prefetch throttling and data pinning schemes to improve performance of I/O prefetching. Prefetch throttling prevents one or more clients from issuing further prefetches if such prefetches are predicted to be harmful, i.e., replace from the memory cache the useful data accessed by other clients. Data pinning on the other hand makes selected data blocks immune to harmful prefetches by pinning them in the memory cache. We show that these two schemes can be applied in isolation or combined together, and they can be applied at a coarse or fine granularity. Our experiments with these two optimizations using four disk-intensive applications reveal that they can improve performance by 9.7% and 15.1% on average, over standard compiler-directed I/O prefetching and no-prefetch case, respectively, when 8 clients are used.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115988862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Communication Avoiding Gaussian elimination 避免高斯消去的通信
L. Grigori, J. Demmel, Hua Xiang
We present CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional cyclic layout. The algorithm is based on a new pivoting strategy, which is stable in practice. The new algorithm is optimal (up to polylogarithmic factors) in the amount of communication it performs. Our experiments show that CALU leads to a reduction in the parallel time, in particular when the latency time is an important factor of the overall time. The factorization of a block-column, a subroutine of CALU, outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on an IBM POWER 5 system and up to a factor of 5.58 on a Cray XT4 system. On square matrices of order 104, CALU outperforms the corresponding routine PDGETRF from ScaLAPACK by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4.
针对分布在二维循环布局中的密集矩阵的LU分解问题,提出了一种通信避免算法CALU。该算法基于一种新的旋转策略,在实践中具有较好的稳定性。新算法在其执行的通信量方面是最优的(达到多对数因子)。我们的实验表明,CALU可以减少并行时间,特别是当延迟时间是整体时间的重要因素时。块列(CALU的一个子例程)的分解在IBM POWER 5系统上比ScaLAPACK的相应例程PDGETF2性能高4.37倍,在Cray XT4系统上比PDGETF2性能高5.58倍。在阶为104的方阵上,CALU比来自ScaLAPACK的相应例程PDGETRF的性能在IBM POWER 5上高出1.24倍,在Cray XT4上高出1.31倍。
{"title":"Communication Avoiding Gaussian elimination","authors":"L. Grigori, J. Demmel, Hua Xiang","doi":"10.1109/SC.2008.5214287","DOIUrl":"https://doi.org/10.1109/SC.2008.5214287","url":null,"abstract":"We present CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional cyclic layout. The algorithm is based on a new pivoting strategy, which is stable in practice. The new algorithm is optimal (up to polylogarithmic factors) in the amount of communication it performs. Our experiments show that CALU leads to a reduction in the parallel time, in particular when the latency time is an important factor of the overall time. The factorization of a block-column, a subroutine of CALU, outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on an IBM POWER 5 system and up to a factor of 5.58 on a Cray XT4 system. On square matrices of order 104, CALU outperforms the corresponding routine PDGETRF from ScaLAPACK by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"423 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117351230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets 一种在大规模宏基因组数据集中识别蛋白质家族的有效并行方法
Changjun Wu, A. Kalyanaraman
Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.
宏基因组学是使用最先进的基因组工具研究环境微生物群落。近年来,高通量技术的进步使得大量宏基因组数据的积累成为可能,而这些数据在几年前还被认为是不切实际的。然而,主要的瓶颈是缺乏可扩展的算法和用于大规模数据处理的开源软件。在本文中,我们提出了一种新的平行方法的设计和实现,从大规模宏基因组数据中识别蛋白质家族。给定一组肽序列,我们将问题简化为从二部图中检测任意大小的密集子图的问题。我们的方法通过分而治之和组合模式匹配启发式技术的结合,在分布式内存机器上有效地并行化了该任务。我们使用BlueGene/L超级计算机的512个节点,对来自CAMERA环境序列数据库的160 K随机采样序列进行了广泛的测试,并给出了性能和质量结果。
{"title":"An efficient parallel approach for identifying protein families in large-scale metagenomic data sets","authors":"Changjun Wu, A. Kalyanaraman","doi":"10.1145/1413370.1413406","DOIUrl":"https://doi.org/10.1145/1413370.1413406","url":null,"abstract":"Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124254337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer 在Roadrunner通用异构超级计算机上的369 Tflop/s分子动力学模拟
S. Swaminarayan, K. Kadau, T. Germann, G. Fossum
We present timing and performance numbers for a short-range parallel molecular dynamics (MD) code, SPaSM, that has been rewritten for the heterogeneous Roadrunner supercomputer. Each Roadrunner compute node consists of two AMD Opteron dualcore microprocessors and four PowerXCell 8i enhanced Cell microprocessors, so that there are four MPI ranks per node, each with one Opteron and one Cell. The interatomic forces are computed on the Cells (each with one PPU and eight SPU cores), while the Opterons are used to direct inter-rank communication and perform I/O-heavy periodic analysis, visualization, and checkpointing tasks. The performance measured for our initial implementation of a standard Lennard-Jones pair potential benchmark reached a peak of 369 Tflop/s double-precision floating-point performance on the full Roadrunner system (27.7% of peak), corresponding to 124 MFlop/Watt/s at a price of approximately 3.69 MFlops/dollar. We demonstrate an initial target application, the jetting and ejection of material from a shocked surface.
我们给出了一个短程并行分子动力学(MD)代码SPaSM的时间和性能数字,该代码已经为异构Roadrunner超级计算机重写。每个Roadrunner计算节点由两个AMD Opteron双核微处理器和四个PowerXCell 8i增强型Cell微处理器组成,因此每个节点有四个MPI排名,每个节点有一个Opteron和一个Cell。原子间作用力在cell上计算(每个cell有一个PPU和8个SPU内核),而Opterons用于指导层间通信并执行重I/ o的周期分析、可视化和点检任务。我们对标准Lennard-Jones对潜在基准的初始实现进行了性能测试,在完整的Roadrunner系统上达到了369 Tflop/s的双精度浮点性能峰值(峰值的27.7%),相当于124 MFlop/Watt/s,价格约为3.69 MFlops/美元。我们演示了一个初始目标应用,即从受冲击表面喷射和抛射材料。
{"title":"369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer","authors":"S. Swaminarayan, K. Kadau, T. Germann, G. Fossum","doi":"10.1145/1413370.1413436","DOIUrl":"https://doi.org/10.1145/1413370.1413436","url":null,"abstract":"We present timing and performance numbers for a short-range parallel molecular dynamics (MD) code, SPaSM, that has been rewritten for the heterogeneous Roadrunner supercomputer. Each Roadrunner compute node consists of two AMD Opteron dualcore microprocessors and four PowerXCell 8i enhanced Cell microprocessors, so that there are four MPI ranks per node, each with one Opteron and one Cell. The interatomic forces are computed on the Cells (each with one PPU and eight SPU cores), while the Opterons are used to direct inter-rank communication and perform I/O-heavy periodic analysis, visualization, and checkpointing tasks. The performance measured for our initial implementation of a standard Lennard-Jones pair potential benchmark reached a peak of 369 Tflop/s double-precision floating-point performance on the full Roadrunner system (27.7% of peak), corresponding to 124 MFlop/Watt/s at a price of approximately 3.69 MFlops/dollar. We demonstrate an initial target application, the jetting and ejection of material from a shocked surface.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2581 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128796673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Massively parallel volume rendering using 2–3 swap image compositing 大规模并行体渲染使用2-3交换图像合成
Hongfeng Yu, Chaoli Wang, K. Ma
The ever-increasing amounts of simulation data produced by scientists demand high-end parallel visualization capability. However, image compositing, which requires interprocessor communication, is often the bottleneck stage for parallel rendering of large volume data sets. Existing image compositing solutions either incur a large number of messages exchanged among processors (such as the direct send method), or limit the number of processors that can be effectively utilized (such as the binary swap method). We introduce a new image compositing algorithm, called 2-3 swap, which combines the flexibility of the direct send method and the optimality of the binary swap method. The 2-3 swap algorithm allows an arbitrary number of processors to be used for compositing, and fully utilizes all participating processors throughout the course of the compositing. We experiment with this image compositing solution on a supercomputer with thousands of processors, and demonstrate its great flexibility as well as scalability.
科学家产生的越来越多的仿真数据需要高端的并行可视化能力。然而,需要处理器间通信的图像合成往往是大容量数据集并行渲染的瓶颈阶段。现有的图像合成解决方案要么导致处理器之间交换大量消息(如直接发送方法),要么限制可以有效利用的处理器数量(如二进制交换方法)。本文提出了一种新的图像合成算法,称为2-3交换算法,它结合了直接发送方法的灵活性和二进制交换方法的最优性。2-3交换算法允许使用任意数量的处理器进行合成,并在整个合成过程中充分利用所有参与的处理器。我们在一台拥有数千个处理器的超级计算机上对这种图像合成解决方案进行了实验,并展示了其巨大的灵活性和可扩展性。
{"title":"Massively parallel volume rendering using 2–3 swap image compositing","authors":"Hongfeng Yu, Chaoli Wang, K. Ma","doi":"10.1145/1508044.1508084","DOIUrl":"https://doi.org/10.1145/1508044.1508084","url":null,"abstract":"The ever-increasing amounts of simulation data produced by scientists demand high-end parallel visualization capability. However, image compositing, which requires interprocessor communication, is often the bottleneck stage for parallel rendering of large volume data sets. Existing image compositing solutions either incur a large number of messages exchanged among processors (such as the direct send method), or limit the number of processors that can be effectively utilized (such as the binary swap method). We introduce a new image compositing algorithm, called 2-3 swap, which combines the flexibility of the direct send method and the optimality of the binary swap method. The 2-3 swap algorithm allows an arbitrary number of processors to be used for compositing, and fully utilizes all participating processors throughout the course of the compositing. We experiment with this image compositing solution on a supercomputer with thousands of processors, and demonstrate its great flexibility as well as scalability.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122778720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 108
Parallel I/O prefetching using MPI file caching and I/O signatures 使用MPI文件缓存和I/O签名的并行I/O预取
S. Byna, Yong Chen, Xian-He Sun, R. Thakur, W. Gropp
Parallel I/O prefetching is considered to be effective in improving I/O performance. However, the effectiveness depends on determining patterns among future I/O accesses swiftly and fetching data in time, which is difficult to achieve in general. In this study, we propose an I/O signature-based prefetching strategy. The idea is to use a predetermined I/O signature of an application to guide prefetching. To put this idea to work, we first derived a classification of patterns and introduced a simple and effective signature notation to represent patterns. We then developed a toolkit to trace and generate I/O signatures automatically. Finally, we designed and implemented a thread-based client-side collective prefetching cache layer for MPI-IO library to support prefetching. A prefetching thread reads I/O signatures of an application and adjusts them by observing I/O accesses at runtime. Experimental results show that the proposed prefetching method improves I/O performance significantly for applications with complex patterns.
并行I/O预取被认为是提高I/O性能的有效方法。但是,其有效性取决于能否快速确定未来I/O访问之间的模式并及时获取数据,这在一般情况下很难实现。在本研究中,我们提出了一种基于I/O签名的预取策略。其思想是使用应用程序的预定I/O签名来指导预取。为了实现这个想法,我们首先推导了模式的分类,并引入了一个简单而有效的签名符号来表示模式。然后,我们开发了一个工具包来自动跟踪和生成I/O签名。最后,我们为MPI-IO库设计并实现了一个基于线程的客户端集合预取缓存层来支持预取。预取线程读取应用程序的I/O签名,并通过在运行时观察I/O访问来调整它们。实验结果表明,该预取方法可显著提高具有复杂模式的应用程序的I/O性能。
{"title":"Parallel I/O prefetching using MPI file caching and I/O signatures","authors":"S. Byna, Yong Chen, Xian-He Sun, R. Thakur, W. Gropp","doi":"10.1109/SC.2008.5213604","DOIUrl":"https://doi.org/10.1109/SC.2008.5213604","url":null,"abstract":"Parallel I/O prefetching is considered to be effective in improving I/O performance. However, the effectiveness depends on determining patterns among future I/O accesses swiftly and fetching data in time, which is difficult to achieve in general. In this study, we propose an I/O signature-based prefetching strategy. The idea is to use a predetermined I/O signature of an application to guide prefetching. To put this idea to work, we first derived a classification of patterns and introduced a simple and effective signature notation to represent patterns. We then developed a toolkit to trace and generate I/O signatures automatically. Finally, we designed and implemented a thread-based client-side collective prefetching cache layer for MPI-IO library to support prefetching. A prefetching thread reads I/O signatures of an application and adjusts them by observing I/O accesses at runtime. Experimental results show that the proposed prefetching method improves I/O performance significantly for applications with complex patterns.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126327793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 144
BitDew: A programmable environment for large-scale data management and distribution BitDew:用于大规模数据管理和分发的可编程环境
G. Fedak, Haiwu He, F. Cappello
Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing a general approach is still a challenging issue. To address this problem, we propose the BitDew framework, a programmable environment for automatic and transparent data management on computational Desktop Grids. This paper describes the BitDew programming interface, its architecture, and the performance evaluation of its runtime components. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction. The Bitdew runtime environment is a flexible distributed service architecture that integrates modular P2P components such as DHT's for a distributed data catalog and collaborative transport protocols for data distribution. Through several examples, we describe how application programmers and Bitdew users can exploit Bitdew's features. The performance evaluation demonstrates that the high level of abstraction and transparency is obtained with a reasonable overhead, while offering the benefit of scalability, performance and fault tolerance with little programming cost.
桌面网格利用分布在多个局域网或Internet上的空闲桌面PC的计算、网络和存储资源来计算各种需要资源的分布式应用程序。这些应用程序需要访问、计算、存储和循环大量的数据,但在这种大规模、动态、异构、易变和高度分布式网格中的数据管理却很少受到重视。在大多数情况下,数据管理依赖于临时解决方案,提供通用方法仍然是一个具有挑战性的问题。为了解决这个问题,我们提出了BitDew框架,这是一个可编程环境,用于在计算桌面网格上进行自动和透明的数据管理。本文介绍了BitDew编程接口、架构及其运行时组件的性能评估。BitDew依赖于一组特定的元数据来驱动关键的数据管理操作,即生命周期、分布、放置、复制和高度抽象的容错。Bitdew运行时环境是一个灵活的分布式服务架构,它集成了模块化的P2P组件,如用于分布式数据目录的DHT组件和用于数据分发的协作传输协议。通过几个例子,我们描述了应用程序程序员和Bitdew用户如何利用Bitdew的功能。性能评估表明,该方法以合理的开销获得了高度的抽象和透明,同时以较少的编程成本提供了可伸缩性、性能和容错性的好处。
{"title":"BitDew: A programmable environment for large-scale data management and distribution","authors":"G. Fedak, Haiwu He, F. Cappello","doi":"10.1109/SC.2008.5213939","DOIUrl":"https://doi.org/10.1109/SC.2008.5213939","url":null,"abstract":"Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing a general approach is still a challenging issue. To address this problem, we propose the BitDew framework, a programmable environment for automatic and transparent data management on computational Desktop Grids. This paper describes the BitDew programming interface, its architecture, and the performance evaluation of its runtime components. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction. The Bitdew runtime environment is a flexible distributed service architecture that integrates modular P2P components such as DHT's for a distributed data catalog and collaborative transport protocols for data distribution. Through several examples, we describe how application programmers and Bitdew users can exploit Bitdew's features. The performance evaluation demonstrates that the high level of abstraction and transparency is obtained with a reasonable overhead, while offering the benefit of scalability, performance and fault tolerance with little programming cost.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129211017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
期刊
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1