Intel's 80-core terascale processor was the first generally programmable microprocessor to break the Teraflops barrier. The primary goal for the chip was to study power management and on-die communication technologies. When announced in 2007, it received a great deal of attention for running a stencil kernel at 1.0 single precision TFLOPS while using only 97 Watts. The literature about the chip, however, focused on the hardware, saying little about the software environment or the kernels used to evaluate the chip. This paper completes the literature on the 80-core terascale processor by fully defining the chip's software environment. We describe the instruction set, the programming environment, the kernels written for the chip, and our experiences programming this microprocessor. We close by discussing the lessons learned from this project and what it implies for future message passing, network-on-a-chip processors.
{"title":"Programming the Intel 80-core network-on-a-chip Terascale Processor","authors":"T. Mattson, R. V. D. Wijngaart, M. Frumkin","doi":"10.1109/SC.2008.5213921","DOIUrl":"https://doi.org/10.1109/SC.2008.5213921","url":null,"abstract":"Intel's 80-core terascale processor was the first generally programmable microprocessor to break the Teraflops barrier. The primary goal for the chip was to study power management and on-die communication technologies. When announced in 2007, it received a great deal of attention for running a stencil kernel at 1.0 single precision TFLOPS while using only 97 Watts. The literature about the chip, however, focused on the hardware, saying little about the software environment or the kernels used to evaluate the chip. This paper completes the literature on the 80-core terascale processor by fully defining the chip's software environment. We describe the instruction set, the programming environment, the kernels written for the chip, and our experiences programming this microprocessor. We close by discussing the lessons learned from this project and what it implies for future message passing, network-on-a-chip processors.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123852447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli
We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.
我们提出了在gpu上计算高性能离散傅里叶变换的新算法。我们提出了分层的、混合基数的FFT算法,用于2的幂和非2的幂的大小。我们的分层FFT算法使用Stockham公式有效地利用gpu上的共享内存。我们通过将转置组合成基于块的多fft算法来减少分层算法中的内存转置开销。对于非2次幂大小,我们使用小素数的混合基数fft和Bluestein算法的组合。我们在Bluestein算法中使用模算法来提高准确率。我们使用NVIDIA CUDA API实现算法,并将其性能与NVIDIA的CUFFT库和高端四核CPU上的优化CPU实现(英特尔的MKL)进行比较。在NVIDIA GPU上,我们获得了高达300 GFlops的性能,在大尺寸情况下,通常性能比CUFFT提高2-4倍,比MKL提高8-40倍。
{"title":"High performance discrete Fourier transforms on graphics processors","authors":"N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli","doi":"10.1109/SC.2008.5213922","DOIUrl":"https://doi.org/10.1109/SC.2008.5213922","url":null,"abstract":"We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116625622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Alvarez, M. Summers, Don E. Maxwell, M. Eisenbach, J. Meredith, J. Larkin, J. Levesque, T. Maier, P. Kent, E. D'Azevedo, T. Schulthess
Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo (QMC) simulations of high temperature (high-Tc) superconductivity in a microscopic model, the two dimensional (2D) Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single-/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT4 systems of the Oak Ridge National Laboratory (ORNL), for example, we currently run production jobs on 31 thousand processors and thereby routinely achieve a sustained performance that exceeds 200 TFlop/s. On a system with 49 thousand processors we achieved a sustained performance of 409 TFlop/s. We present a study of how random disorder in the effective Coulomb interaction strength affects the superconducting transition temperature in the Hubbard model.
{"title":"New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors","authors":"G. Alvarez, M. Summers, Don E. Maxwell, M. Eisenbach, J. Meredith, J. Larkin, J. Levesque, T. Maier, P. Kent, E. D'Azevedo, T. Schulthess","doi":"10.1109/SC.2008.5218119","DOIUrl":"https://doi.org/10.1109/SC.2008.5218119","url":null,"abstract":"Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo (QMC) simulations of high temperature (high-Tc) superconductivity in a microscopic model, the two dimensional (2D) Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single-/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT4 systems of the Oak Ridge National Laboratory (ORNL), for example, we currently run production jobs on 31 thousand processors and thereby routinely achieve a sustained performance that exceeds 200 TFlop/s. On a system with 49 thousand processors we achieved a sustained performance of 409 TFlop/s. We present a study of how random disorder in the effective Coulomb interaction strength affects the superconducting transition temperature in the Hubbard model.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129272129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Raicu, Zhao Zhang, M. Wilde, Ian T Foster, P. Beckman, K. Iskra, Ben Clifford
We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160 K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.
我们扩展了Falkon轻量级任务执行框架,使千万亿级系统上的松耦合编程成为一种实用而有用的编程模型。这项工作研究和测量应用这种方法所涉及的性能因素,以使更广泛的用户社区能够更轻松地使用千兆级系统。我们的工作能够执行由松散耦合的串行作业组成的高度并行计算,而无需修改各自的应用程序。这种方法允许使用千兆级系统(例如IBM Blue Gene/P超级计算机)的新型(可能更大的)应用程序。我们展示了在实现该模型时遇到的I/O性能挑战,并展示了使用微基准测试和来自两个领域的实际应用的结果:经济能量建模和分子动力学。我们的基准测试表明,我们可以高效地扩展到160k处理器内核,并且可以实现每秒数千个任务的持续执行速度。
{"title":"Toward loosely coupled programming on petascale systems","authors":"I. Raicu, Zhao Zhang, M. Wilde, Ian T Foster, P. Beckman, K. Iskra, Ben Clifford","doi":"10.1109/SC.2008.5219768","DOIUrl":"https://doi.org/10.1109/SC.2008.5219768","url":null,"abstract":"We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160 K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124219787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}