首页 > 最新文献

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Programming the Intel 80-core network-on-a-chip Terascale Processor 编程英特尔80核网络芯片上的泰斯卡尔处理器
T. Mattson, R. V. D. Wijngaart, M. Frumkin
Intel's 80-core terascale processor was the first generally programmable microprocessor to break the Teraflops barrier. The primary goal for the chip was to study power management and on-die communication technologies. When announced in 2007, it received a great deal of attention for running a stencil kernel at 1.0 single precision TFLOPS while using only 97 Watts. The literature about the chip, however, focused on the hardware, saying little about the software environment or the kernels used to evaluate the chip. This paper completes the literature on the 80-core terascale processor by fully defining the chip's software environment. We describe the instruction set, the programming environment, the kernels written for the chip, and our experiences programming this microprocessor. We close by discussing the lessons learned from this project and what it implies for future message passing, network-on-a-chip processors.
英特尔的80核万亿级处理器是第一个突破每秒万亿次浮点运算大关的通用可编程微处理器。该芯片的主要目标是研究电源管理和片上通信技术。当2007年发布的时候,它因为在只使用97瓦的情况下以1.0的单精度TFLOPS运行一个模板内核而受到了极大的关注。然而,关于芯片的文献主要集中在硬件上,很少提到用于评估芯片的软件环境或内核。本文通过对芯片软件环境的全面定义,完成了对80核兆级处理器的文献研究。我们描述了指令集、编程环境、为芯片编写的内核,以及我们编写该微处理器的经验。最后,我们讨论了从这个项目中学到的经验教训,以及它对未来的消息传递、片上网络处理器意味着什么。
{"title":"Programming the Intel 80-core network-on-a-chip Terascale Processor","authors":"T. Mattson, R. V. D. Wijngaart, M. Frumkin","doi":"10.1109/SC.2008.5213921","DOIUrl":"https://doi.org/10.1109/SC.2008.5213921","url":null,"abstract":"Intel's 80-core terascale processor was the first generally programmable microprocessor to break the Teraflops barrier. The primary goal for the chip was to study power management and on-die communication technologies. When announced in 2007, it received a great deal of attention for running a stencil kernel at 1.0 single precision TFLOPS while using only 97 Watts. The literature about the chip, however, focused on the hardware, saying little about the software environment or the kernels used to evaluate the chip. This paper completes the literature on the 80-core terascale processor by fully defining the chip's software environment. We describe the instruction set, the programming environment, the kernels written for the chip, and our experiences programming this microprocessor. We close by discussing the lessons learned from this project and what it implies for future message passing, network-on-a-chip processors.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123852447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
High performance discrete Fourier transforms on graphics processors 图形处理器上的高性能离散傅里叶变换
N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli
We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.
我们提出了在gpu上计算高性能离散傅里叶变换的新算法。我们提出了分层的、混合基数的FFT算法,用于2的幂和非2的幂的大小。我们的分层FFT算法使用Stockham公式有效地利用gpu上的共享内存。我们通过将转置组合成基于块的多fft算法来减少分层算法中的内存转置开销。对于非2次幂大小,我们使用小素数的混合基数fft和Bluestein算法的组合。我们在Bluestein算法中使用模算法来提高准确率。我们使用NVIDIA CUDA API实现算法,并将其性能与NVIDIA的CUFFT库和高端四核CPU上的优化CPU实现(英特尔的MKL)进行比较。在NVIDIA GPU上,我们获得了高达300 GFlops的性能,在大尺寸情况下,通常性能比CUFFT提高2-4倍,比MKL提高8-40倍。
{"title":"High performance discrete Fourier transforms on graphics processors","authors":"N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli","doi":"10.1109/SC.2008.5213922","DOIUrl":"https://doi.org/10.1109/SC.2008.5213922","url":null,"abstract":"We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116625622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 315
New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors 新算法使400+ TFlop/s持续性能在模拟高tc超导体的无序效应
G. Alvarez, M. Summers, Don E. Maxwell, M. Eisenbach, J. Meredith, J. Larkin, J. Levesque, T. Maier, P. Kent, E. D'Azevedo, T. Schulthess
Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo (QMC) simulations of high temperature (high-Tc) superconductivity in a microscopic model, the two dimensional (2D) Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single-/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT4 systems of the Oak Ridge National Laboratory (ORNL), for example, we currently run production jobs on 31 thousand processors and thereby routinely achieve a sustained performance that exceeds 200 TFlop/s. On a system with 49 thousand processors we achieved a sustained performance of 409 TFlop/s. We present a study of how random disorder in the effective Coulomb interaction strength affects the superconducting transition temperature in the Hubbard model.
近年来惊人的计算和算法进步使得系统的量子蒙特卡罗(QMC)模拟高温(高tc)超导的微观模型,二维(2D)哈伯德模型,与铜材料相关的参数成为可能。在这里,我们报告了算法和计算的进步,使我们能够研究无序和纳米尺度的不均匀性对对形成和超导转变温度的影响,这是理解真实材料所必需的。模拟代码是用一种通用的、可扩展的方法编写的,并且经过调优,可以在规模上表现良好。为了有效地利用当前的超级计算架构,已经对算法进行了重大改进。通过实现延迟蒙特卡罗更新和混合单/双精度模式,我们能够显著提高代码的效率。例如,在橡树岭国家实验室(ORNL)的Cray XT4系统上,我们目前在31000个处理器上运行生产作业,从而经常实现超过200 TFlop/s的持续性能。在一个拥有4.9万个处理器的系统上,我们实现了409 TFlop/s的持续性能。我们研究了有效库仑相互作用强度的随机无序如何影响Hubbard模型中的超导转变温度。
{"title":"New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors","authors":"G. Alvarez, M. Summers, Don E. Maxwell, M. Eisenbach, J. Meredith, J. Larkin, J. Levesque, T. Maier, P. Kent, E. D'Azevedo, T. Schulthess","doi":"10.1109/SC.2008.5218119","DOIUrl":"https://doi.org/10.1109/SC.2008.5218119","url":null,"abstract":"Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo (QMC) simulations of high temperature (high-Tc) superconductivity in a microscopic model, the two dimensional (2D) Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single-/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT4 systems of the Oak Ridge National Laboratory (ORNL), for example, we currently run production jobs on 31 thousand processors and thereby routinely achieve a sustained performance that exceeds 200 TFlop/s. On a system with 49 thousand processors we achieved a sustained performance of 409 TFlop/s. We present a study of how random disorder in the effective Coulomb interaction strength affects the superconducting transition temperature in the Hubbard model.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129272129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Toward loosely coupled programming on petascale systems 面向千万亿级系统的松耦合编程
I. Raicu, Zhao Zhang, M. Wilde, Ian T Foster, P. Beckman, K. Iskra, Ben Clifford
We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160 K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.
我们扩展了Falkon轻量级任务执行框架,使千万亿级系统上的松耦合编程成为一种实用而有用的编程模型。这项工作研究和测量应用这种方法所涉及的性能因素,以使更广泛的用户社区能够更轻松地使用千兆级系统。我们的工作能够执行由松散耦合的串行作业组成的高度并行计算,而无需修改各自的应用程序。这种方法允许使用千兆级系统(例如IBM Blue Gene/P超级计算机)的新型(可能更大的)应用程序。我们展示了在实现该模型时遇到的I/O性能挑战,并展示了使用微基准测试和来自两个领域的实际应用的结果:经济能量建模和分子动力学。我们的基准测试表明,我们可以高效地扩展到160k处理器内核,并且可以实现每秒数千个任务的持续执行速度。
{"title":"Toward loosely coupled programming on petascale systems","authors":"I. Raicu, Zhao Zhang, M. Wilde, Ian T Foster, P. Beckman, K. Iskra, Ben Clifford","doi":"10.1109/SC.2008.5219768","DOIUrl":"https://doi.org/10.1109/SC.2008.5219768","url":null,"abstract":"We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160 K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124219787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 146
期刊
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1