首页 > 最新文献

Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation最新文献

英文 中文
Packing/unpacking information generation for efficient generalized kr/spl rarr/r and r/spl rarr/kr array redistribution 有效的广义kr/spl rarr/r和r/spl rarr/kr阵列重分配的装箱/解装箱信息生成
Ching-Hsien Hsu, Yeh-Ching Chung, C. Dow
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods to generate the packing/unpacking information for BOLCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BOLCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution with arbitrary source/destination processor sets. The most significant improvement of this paper is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information derived from kr/spl rarr/r and r/spl rarr/kr redistributions, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the PITFALLS method and the Prylli's method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the PITFALLS method and the Prylli's method for all test samples.
在分布式存储多计算机上的许多并行程序中,为了提高算法性能,通常需要对数组进行重分配。由于它是在运行时执行的,因此在算法的后续阶段分解新数据的效率和在处理器之间重新分配数据的成本之间存在性能权衡。本文提出了在任意源/目标处理器集下,生成BOLCK-CYCLIC(kr)到BLOCK-CYCLIC(r)和BOLCK-CYCLIC(r)到BLOCK-CYCLIC(kr)再分发的打包/解包信息的有效方法。本文最显著的改进是处理器不需要为再分发构建发送/接收数据集。基于从kr/spl rarr/r和r/spl rarr/kr重分发中获得的打包/解包信息,处理器可以直接将数组元素打包/解包到(from)消息中。为了评估我们的方法的性能,我们在IBM SP2并行机上实现了我们的方法以及陷阱方法和Prylli的方法。实验结果表明,我们的算法在所有测试样本上都优于陷阱方法和Prylli方法。
{"title":"Packing/unpacking information generation for efficient generalized kr/spl rarr/r and r/spl rarr/kr array redistribution","authors":"Ching-Hsien Hsu, Yeh-Ching Chung, C. Dow","doi":"10.1109/FMPC.1999.750588","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750588","url":null,"abstract":"Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods to generate the packing/unpacking information for BOLCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BOLCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution with arbitrary source/destination processor sets. The most significant improvement of this paper is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information derived from kr/spl rarr/r and r/spl rarr/kr redistributions, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the PITFALLS method and the Prylli's method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the PITFALLS method and the Prylli's method for all test samples.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114968924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient VLSI layouts of hypercubic networks 超立方网络的高效VLSI布局
C. Yeh, Emmanouel Varvarigos, B. Parhami
In this paper we present efficient VLSI layouts of several hypercubic networks. We show that an N-node hypercube and an N-node cube-connected cycles (CCC) graph can be laid out in 4N/sup 2//9+o(N/sup 2/) and 4N/sup 2//(9 log/sub 2//sup 2/N)+o(N/sup 2//log/sup 2/ N) areas, respectively, both of which are optimal within a factor of 1.7~+o(1). We introduce the multilayer grid model, and present efficient layouts of hypercubes that use more than 2 layers of wires. We derive efficient layouts for butterfly networks, generalized hypercubes, hierarchical swapped networks, and indirect swapped networks, that are optimal within a factor of 1+o(1). We also present efficient layouts for folded hypercubes, reduced hypercubes, recursive hierarchical swapped networks, and enhanced-cubes, which are the best results reported for these networks thus far.
本文提出了几种超立方网络的高效VLSI布局。在4N/sup 2//9+o(N/sup 2/)和4N/sup 2//(9 log/sub 2//sup 2/N)+o(N/sup 2//log/sup 2/N)区域内分别可以布设N节点超立方和N节点立方连接环(CCC)图,两者在1.7~+o(1)的因子范围内都是最优的。我们引入了多层网格模型,并给出了使用2层以上导线的超立方体的高效布局。我们推导了蝴蝶网络、广义超立方体、分层交换网络和间接交换网络的有效布局,它们在1+ 0(1)的因子范围内是最优的。我们还提出了折叠超立方体、简化超立方体、递归分层交换网络和增强立方体的有效布局,这是迄今为止报道的这些网络的最佳结果。
{"title":"Efficient VLSI layouts of hypercubic networks","authors":"C. Yeh, Emmanouel Varvarigos, B. Parhami","doi":"10.1109/FMPC.1999.750589","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750589","url":null,"abstract":"In this paper we present efficient VLSI layouts of several hypercubic networks. We show that an N-node hypercube and an N-node cube-connected cycles (CCC) graph can be laid out in 4N/sup 2//9+o(N/sup 2/) and 4N/sup 2//(9 log/sub 2//sup 2/N)+o(N/sup 2//log/sup 2/ N) areas, respectively, both of which are optimal within a factor of 1.7~+o(1). We introduce the multilayer grid model, and present efficient layouts of hypercubes that use more than 2 layers of wires. We derive efficient layouts for butterfly networks, generalized hypercubes, hierarchical swapped networks, and indirect swapped networks, that are optimal within a factor of 1+o(1). We also present efficient layouts for folded hypercubes, reduced hypercubes, recursive hierarchical swapped networks, and enhanced-cubes, which are the best results reported for these networks thus far.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121159253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Java for numerically intensive computing: from flops to gigaflops 用于数字密集型计算的Java:从flop到gigaflops
S. Midkiff, J. Moreira, M. Snir
Java is not thought of as being competitive with Fortran for numerical programming. In this paper, we discuss technologies that can and will deliver Fortran-like performance in Java. These techniques include new and existing compiler technologies, the exploitation of parallelism, and a collection of Java libraries for numerical computing. We also present experimental data to show the effectiveness of our approaches. In particular we achieve 1 Gflops with a linear algebra kernel on an RS/6000 SMP machine. Most of these techniques require no language changes; a few depend on extensions to Java currently under consideration.
在数值编程方面,Java并不被认为是Fortran的竞争对手。在本文中,我们将讨论能够并将在Java中提供类似fortran性能的技术。这些技术包括新的和现有的编译器技术、并行性的利用以及用于数值计算的Java库集合。我们还提供了实验数据来证明我们的方法的有效性。特别是,我们在RS/6000 SMP机器上使用线性代数内核实现了1 Gflops。大多数这些技术不需要改变语言;一些依赖于目前正在考虑的Java扩展。
{"title":"Java for numerically intensive computing: from flops to gigaflops","authors":"S. Midkiff, J. Moreira, M. Snir","doi":"10.1109/FMPC.1999.750607","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750607","url":null,"abstract":"Java is not thought of as being competitive with Fortran for numerical programming. In this paper, we discuss technologies that can and will deliver Fortran-like performance in Java. These techniques include new and existing compiler technologies, the exploitation of parallelism, and a collection of Java libraries for numerical computing. We also present experimental data to show the effectiveness of our approaches. In particular we achieve 1 Gflops with a linear algebra kernel on an RS/6000 SMP machine. Most of these techniques require no language changes; a few depend on extensions to Java currently under consideration.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128941574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A framework for generating task parallel programs 生成任务并行程序的框架
U. Fissgus, T. Rauber, G. Runger
We consider the generation of mixed task and data parallel programs and discuss how a clear separation into a task and data parallel level can support the development of efficient programs. The program development starts with a specification of the maximum degree of task and data parallelism and proceeds by performing several derivation steps in which the degree of parallelism is adapted to a specific parallel machine. We show how the final message-passing programs are generated and how the interaction between the task and data parallel levels can be established. We demonstrate the usefulness of the approach by examples from numerical analysis which offer the potential of a mixed task and data parallel execution but for which it is not a priori clear, how this potential should be used for an implementation on a specific parallel machine.
我们考虑了混合任务和数据并行程序的生成,并讨论了任务和数据并行级别的明确分离如何支持高效程序的开发。程序开发从任务和数据并行度的最大程度的规范开始,并通过执行几个派生步骤进行,其中并行度适应于特定的并行机。我们将展示如何生成最终的消息传递程序,以及如何建立任务和数据并行级别之间的交互。我们通过数值分析的例子来证明这种方法的实用性,这些例子提供了混合任务和数据并行执行的潜力,但对于这种潜力如何用于特定并行机上的实现并不是先验明确的。
{"title":"A framework for generating task parallel programs","authors":"U. Fissgus, T. Rauber, G. Runger","doi":"10.1109/FMPC.1999.750586","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750586","url":null,"abstract":"We consider the generation of mixed task and data parallel programs and discuss how a clear separation into a task and data parallel level can support the development of efficient programs. The program development starts with a specification of the maximum degree of task and data parallelism and proceeds by performing several derivation steps in which the degree of parallelism is adapted to a specific parallel machine. We show how the final message-passing programs are generated and how the interaction between the task and data parallel levels can be established. We demonstrate the usefulness of the approach by examples from numerical analysis which offer the potential of a mixed task and data parallel execution but for which it is not a priori clear, how this potential should be used for an implementation on a specific parallel machine.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117147946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A recursive PVM implementation of an image segmentation algorithm with performance results comparing the HIVE and the Cray T3E 一种递归PVM实现的图像分割算法,性能结果比较HIVE和Cray T3E
J. Tilton
A recursive PVM (Parallel Virtual Machine) implementation of a high quality but computationally intensive image segmentation approach is described and the performance of the algorithm on the HIVE and on the Cray T3E is contrasted. The image segmentation algorithm, which is designed for the analysis of multispectral or hyperspectral remotely sensed imagery data, is a hybrid of region growing and spectral clustering that produces a hierarchical set of image segmentations based on detected natural convergence points. The HIVE is a Beowulf-class parallel computer consisting of 66 Pentium Pro PCs (64 slaves and 2 controllers) with 2 processors per PC (for 128 total slave processors) which was developed and assembled by the Applied Information Sciences Branch at NASA's Goddard Space Flight Center. The Cray T3E is a supercomputer with 512 available processors, which is installed at the NASA Center for Computational Science at NASA's Goddard Space Flight Center. Timing results on Landsat Multispectral Scanner data show that the algorithm runs approximately 1.5 times faster on the HIVE, even though the HIVE is some 86 times less costly than the Cray T3E.
描述了一种高质量但计算密集型图像分割方法的递归PVM(并行虚拟机)实现,并对比了该算法在HIVE和Cray T3E上的性能。图像分割算法是针对多光谱或高光谱遥感图像数据分析而设计的,它是一种混合区域增长和光谱聚类的算法,基于检测到的自然收敛点产生分层的图像分割集。HIVE是一个贝奥武夫级并行计算机,由66个奔腾Pro PC(64个从机和2个控制器)组成,每台PC有2个处理器(总共128个从机),由美国宇航局戈达德太空飞行中心的应用信息科学部门开发和组装。克雷T3E是一台拥有512个可用处理器的超级计算机,安装在美国宇航局戈达德太空飞行中心的美国宇航局计算科学中心。Landsat多光谱扫描仪数据的时序结果表明,该算法在HIVE上的运行速度约为1.5倍,尽管HIVE的成本比Cray T3E低约86倍。
{"title":"A recursive PVM implementation of an image segmentation algorithm with performance results comparing the HIVE and the Cray T3E","authors":"J. Tilton","doi":"10.1109/FMPC.1999.750594","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750594","url":null,"abstract":"A recursive PVM (Parallel Virtual Machine) implementation of a high quality but computationally intensive image segmentation approach is described and the performance of the algorithm on the HIVE and on the Cray T3E is contrasted. The image segmentation algorithm, which is designed for the analysis of multispectral or hyperspectral remotely sensed imagery data, is a hybrid of region growing and spectral clustering that produces a hierarchical set of image segmentations based on detected natural convergence points. The HIVE is a Beowulf-class parallel computer consisting of 66 Pentium Pro PCs (64 slaves and 2 controllers) with 2 processors per PC (for 128 total slave processors) which was developed and assembled by the Applied Information Sciences Branch at NASA's Goddard Space Flight Center. The Cray T3E is a supercomputer with 512 available processors, which is installed at the NASA Center for Computational Science at NASA's Goddard Space Flight Center. Timing results on Landsat Multispectral Scanner data show that the algorithm runs approximately 1.5 times faster on the HIVE, even though the HIVE is some 86 times less costly than the Cray T3E.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124472412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
A data-parallel algorithm for iterative tomographic image reconstruction 迭代层析图像重建的数据并行算法
C. Johnson, A. Sofer
In the tomographic imaging problem images are reconstructed from a set of measured projections. Iterative reconstruction methods are computationally intensive alternatives to the more traditional Fourier-based methods. Despite their high cost, the popularity of these methods is increasing because of the advantages they pose. Although numerous iterative methods have been proposed over the years, all of these methods can be shown to have a similar computational structure. This paper presents a parallel algorithm that we originally developed for performing the expectation maximization algorithm in emission tomography. This algorithm is capable of exploiting the sparsity and symmetries of the model in a computationally efficient manner. Our parallelization scheme is based upon decomposition of the measurement-space vectors. We demonstrate that such a parallelization scheme is applicable to the vast majority of iterative reconstruction algorithms proposed to date.
在层析成像问题中,图像是由一组测量的投影重建的。迭代重建方法是更传统的基于傅立叶的方法的计算密集型替代方法。尽管这些方法的成本很高,但由于它们的优点,它们的受欢迎程度正在增加。尽管多年来提出了许多迭代方法,但所有这些方法都具有相似的计算结构。本文提出了一种并行算法,该算法是我们最初为实现发射层析成像中的期望最大化算法而开发的。该算法能够以高效的计算方式利用模型的稀疏性和对称性。我们的并行化方案是基于测量空间向量的分解。我们证明了这种并行化方案适用于迄今为止提出的绝大多数迭代重建算法。
{"title":"A data-parallel algorithm for iterative tomographic image reconstruction","authors":"C. Johnson, A. Sofer","doi":"10.1109/FMPC.1999.750592","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750592","url":null,"abstract":"In the tomographic imaging problem images are reconstructed from a set of measured projections. Iterative reconstruction methods are computationally intensive alternatives to the more traditional Fourier-based methods. Despite their high cost, the popularity of these methods is increasing because of the advantages they pose. Although numerous iterative methods have been proposed over the years, all of these methods can be shown to have a similar computational structure. This paper presents a parallel algorithm that we originally developed for performing the expectation maximization algorithm in emission tomography. This algorithm is capable of exploiting the sparsity and symmetries of the model in a computationally efficient manner. Our parallelization scheme is based upon decomposition of the measurement-space vectors. We demonstrate that such a parallelization scheme is applicable to the vast majority of iterative reconstruction algorithms proposed to date.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Implementing MM5 on NASA Goddard Space Flight Center computing systems: a performance study 在美国宇航局戈达德太空飞行中心计算系统上实施 MM5:性能研究
J. Dorband, J. Kouatchou, J. Michalakes, U. Ranawake
We analyze and test the performance of the fifth-generation PSU/NCAR mesoscale model MM5 on parallel computers at NASA Goddard Space Flight Center. We show how MM5 code scales on the Cray J90, the Cray T3E and a cluster of PCs. More precisely, we are interested in finding the elapsed time, load balancing, speedup, number of floating point operations per second, and performance versus cost. Results obtained with two test problems show the efficiency of MM5 on the above computers especially with large size problems.
我们在美国宇航局戈达德太空飞行中心的并行计算机上分析和测试了第五代 PSU/NCAR 中尺度模型 MM5 的性能。我们展示了 MM5 代码在 Cray J90、Cray T3E 和 PC 集群上的扩展情况。更确切地说,我们对耗时、负载平衡、速度提升、每秒浮点运算次数以及性能与成本的关系感兴趣。通过两个测试问题获得的结果表明,MM5 在上述计算机上的效率很高,尤其是在处理大型问题时。
{"title":"Implementing MM5 on NASA Goddard Space Flight Center computing systems: a performance study","authors":"J. Dorband, J. Kouatchou, J. Michalakes, U. Ranawake","doi":"10.1109/FMPC.1999.750601","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750601","url":null,"abstract":"We analyze and test the performance of the fifth-generation PSU/NCAR mesoscale model MM5 on parallel computers at NASA Goddard Space Flight Center. We show how MM5 code scales on the Cray J90, the Cray T3E and a cluster of PCs. More precisely, we are interested in finding the elapsed time, load balancing, speedup, number of floating point operations per second, and performance versus cost. Results obtained with two test problems show the efficiency of MM5 on the above computers especially with large size problems.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127667478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optimization of a parallel pseudospectral MHD code 一个并行伪谱MHD代码的优化
A. Dubey, T. Clune
In this article we outline some techniques for optimizing spectral codes using multidimensional real-to-complex FFT's. We have successfully applied these techniques on a pseudospectral MHD code running on the CRAY T3E. The code uses half precision, and runs up to 2.5 times faster than the version that uses full precision CRAY SCILIB parallel FFT routines. The half precision version without these optimizations is slower does not scale very well, and cannot support more than 128 processors. The optimized code achieved a performance of 100 Gflops on 1024 nodes of a CRAY T3E-600 at NASA Goddard Space Flight Center.
在本文中,我们概述了一些使用多维实数到复数FFT优化频谱码的技术。我们已经成功地将这些技术应用于CRAY T3E上运行的伪谱MHD代码。代码使用半精度,运行速度比使用全精度CRAY SCILIB并行FFT例程的版本快2.5倍。没有这些优化的半精度版本速度较慢,伸缩性不太好,不能支持超过128个处理器。优化后的代码在美国宇航局戈达德航天飞行中心的CRAY T3E-600的1024个节点上实现了100 gflop的性能。
{"title":"Optimization of a parallel pseudospectral MHD code","authors":"A. Dubey, T. Clune","doi":"10.1109/FMPC.1999.750602","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750602","url":null,"abstract":"In this article we outline some techniques for optimizing spectral codes using multidimensional real-to-complex FFT's. We have successfully applied these techniques on a pseudospectral MHD code running on the CRAY T3E. The code uses half precision, and runs up to 2.5 times faster than the version that uses full precision CRAY SCILIB parallel FFT routines. The half precision version without these optimizations is slower does not scale very well, and cannot support more than 128 processors. The optimized code achieved a performance of 100 Gflops on 1024 nodes of a CRAY T3E-600 at NASA Goddard Space Flight Center.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"45 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132449469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Token space minimization by simulated annealing 模拟退火法令牌空间最小化
Rafi Lohev, I. Gottlieb
We describe a heuristic solution for the minimum token space scheduling (MTSS) problem, based on simulated annealing. In MTSS, one schedules a set of tasks with precedence constraints, represented by a directed graph. The arcs in the graph represent data, or tokens, which the tasks must receive before they can be processed. MTSS seeks to minimize the maximum number of tokens extant at any time during execution, while minimizing completion time. We motivate MTSS with an application from computer architecture: maximizing the locality of data required for execution of a program by multiprocessors. Simulation results demonstrating the effectiveness of our method are presented.
我们描述了一个基于模拟退火的最小令牌空间调度(MTSS)问题的启发式解决方案。在MTSS中,调度一组具有优先约束的任务,用有向图表示。图中的圆弧表示数据或令牌,任务在处理它们之前必须接收到这些数据或令牌。MTSS寻求在执行期间的任何时候最小化现存令牌的最大数量,同时最小化完成时间。我们用计算机体系结构中的一个应用程序来激励MTSS:最大化多处理器执行程序所需的数据的局部性。仿真结果验证了该方法的有效性。
{"title":"Token space minimization by simulated annealing","authors":"Rafi Lohev, I. Gottlieb","doi":"10.1109/FMPC.1999.750604","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750604","url":null,"abstract":"We describe a heuristic solution for the minimum token space scheduling (MTSS) problem, based on simulated annealing. In MTSS, one schedules a set of tasks with precedence constraints, represented by a directed graph. The arcs in the graph represent data, or tokens, which the tasks must receive before they can be processed. MTSS seeks to minimize the maximum number of tokens extant at any time during execution, while minimizing completion time. We motivate MTSS with an application from computer architecture: maximizing the locality of data required for execution of a program by multiprocessors. Simulation results demonstrating the effectiveness of our method are presented.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114081511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Superconducting processors for HTMT: issues and challenges HTMT超导处理器:问题与挑战
K. B. Theobald, G. Gao, T. Sterling
The Hybrid Technology Multi-Threading project is a long-term study of the feasibility of combining several emerging technologies to reach 1 petaFLOPS within ten years. HTMT will combine high-speed superconductor processors, semiconductor memories with built-in processors, high-speed optical interconnects, and high-density holographic storage. While there are major challenges in all aspects of this project, those in processor architecture are the focus of this paper. Fundamental differences between RSFQ circuits and conventional semiconductor circuits, including a radical jump in clock speed, make today's processor design approaches inappropriate for HTMT. Sequential instruction dispatching, even within the lowest programming unit (a strand), will lead to unacceptably high latencies, hence poor performance. We propose alternative processor designs which use fine-grain synchronizations between individual instructions in order to avoid these bottlenecks.
混合技术多线程项目是一项长期研究,旨在结合几种新兴技术,在十年内达到每秒千万亿次浮点运算的可行性。HTMT将结合高速超导体处理器、内置处理器的半导体存储器、高速光学互连和高密度全息存储器。虽然该项目在各个方面都存在重大挑战,但处理器体系结构方面的挑战是本文的重点。RSFQ电路和传统半导体电路之间的根本区别,包括时钟速度的根本飞跃,使得今天的处理器设计方法不适合html。顺序指令调度,即使是在最低的编程单元(一个链)内,也会导致不可接受的高延迟,从而导致较差的性能。为了避免这些瓶颈,我们提出了在单个指令之间使用细粒度同步的替代处理器设计。
{"title":"Superconducting processors for HTMT: issues and challenges","authors":"K. B. Theobald, G. Gao, T. Sterling","doi":"10.1109/FMPC.1999.750608","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750608","url":null,"abstract":"The Hybrid Technology Multi-Threading project is a long-term study of the feasibility of combining several emerging technologies to reach 1 petaFLOPS within ten years. HTMT will combine high-speed superconductor processors, semiconductor memories with built-in processors, high-speed optical interconnects, and high-density holographic storage. While there are major challenges in all aspects of this project, those in processor architecture are the focus of this paper. Fundamental differences between RSFQ circuits and conventional semiconductor circuits, including a radical jump in clock speed, make today's processor design approaches inappropriate for HTMT. Sequential instruction dispatching, even within the lowest programming unit (a strand), will lead to unacceptably high latencies, hence poor performance. We propose alternative processor designs which use fine-grain synchronizations between individual instructions in order to avoid these bottlenecks.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126545258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1